Human population genetics and phenotype data
The SNP rs1738074 in the 5’ UTR of the human TAGAP gene has been identified as a genetic risk factor for a few diseases. Use Ensembl to answer the following questions:
-
In which transcripts is this SNP found?
-
What is the least frequent genotype for this SNP in the Yoruba (YRI) population from the 1000 Genomes phase 3?
-
What is the ancestral allele? Is it conserved in the 91 eutherian mammals EPO-Extended?
-
With which diseases is this SNP associated? Are there any known risk (or associated) alleles?
- Please note there is more than one way to get this answer. Either go to the Variation table of the human TAGAP gene, and use the Consequence filter to only include 5’UTR variants, or search Ensembl for
rs1738074directly. Once you’re in the Variant tab, click on Genes and regulation in the menu.This SNP is found in four transcripts of TAGAP. It is also an intron_variant to one lncRNA transcript of TAGAP-AS1.
- Click on Population genetics in the left-hand panel, or click on Explore this variant in the left-hand panel and click the Population genetics icon.
In Yoruba (YRI), the least frequent genotype is CC at the frequency of 5.6%.
- Click on Phylogenetic context in the left-hand panel.
The ancestral allele is T and it’s inferred from the alignment in primates.
Click on Select an alignment which will open a pop-up menu. Open Multiple alignments and select 91 eutherian mammals EPO-Extended. Click on Apply at the bottom of the menu to save your settings.
A region containing the SNP (highlighted in red and placed in the centre) and its flanking sequence are displayed. The T allele is conserved in all but two of the eutherian mammals displayed.
- Click Phenotype data in the left-hand panel.
This variation is associated with multiple sclerosis, celiac disease and white blood cell count. There are known risk alleles for all three diseases and the corresponding P values are provided. The allele A is associated with celiac disease. Note that the alleles reported by Ensembl are T/C. Ensembl reports alleles on the forward strand. This suggests that A was reported on the reverse strand in the original paper. Similarly, one of the alleles reported for Multiple sclerosis is G.
Exploring VNTR in human
Variable number tandem repeats (VNTRs) show high variation in the number of repeats in the population and are commonly used in forensics (DNA fingerprinting) and to study genetic diversity. (a) Go to the region from 3074666 to 3075100 bp on human chromosome 4. Which gene does it overlap? Which exon of this gene falls in this region?
(b) Configure this page to turn on Repeats (low), Simple repeats (Repeats (low)) and Tandem repeats (TRF) tracks in this view. Can you see any repeats in this exon? What tools were used to annotate the repeats according to the track information?
(c) Zoom in on the (CAG)n to see its sequence. How many CAG repeats can you see in the human reference assembly? Does this track overlap any phenotype-associated variants? What is the identifier of this variant?
(d) Go to the variant tab of the phenotype-associated variant. What is the consequence ontology of this variant? Does the reference allele match the number of repeats you have just counted? What is the shortest and longest allele?
(a) Select Search: Human and type 4:3074666-3075100 in the text box (or alternatively type human 4:3074666-3075100 in the text box). Click Go.
Click on the golden transcript falling in this region. You can see it’s exon 1 of 67 of the huntingtin gene (HTT).
(b) Click Configure this page in the side menu then select: Repeats (low), Simple repeats (Repeats (low)) and Tandem repeats (TRF).
There are three tandem repeats in this exon, and two simple repeats (low); (CAG)n and (CCG)n. Click on the track names to find more about the tools used for annotation: RepeatMasker and Tandem Repeats Finder.
(c) Draw with your mouse a box around the (CAG)n repeat. Click on Jump to region in the pop-up menu.
There are 19 CAG repeats in the human reference sequence overlapping rs71180116 indicated by a pink bar in the All phenotype-associated - short variants (SNPs and indels) track.
(d) Click on the rs71180116 ID to go to the variant tab. You can see in the summary page that this variant is classified as an inframe insertion. Either click + to show all of the alleles in the summary page or go to the Genes and regulation table. This variant has many alternative alleles which differ in the number of repeats. The first allele in the expanded Alleles section of the summary page or the first allele in the Codons column in the Genes and regulation table is the reference allele. It is composed of 19 CAG repeats just as in the Region in detail view. The shortest allele has 7 repeats, the longest has 55 repeats.
Exploring a SNP in mouse
In the paper “Altered metabolic signature in pre-diabetic NOD mice” (PloS One. 2012; 7(4): e35445), Madsen et al. have described several regulatory and coding SNPs, some of them in genes involved in ATP and adenosine metabolism, leading to potentially faulty metabolism of ATP and adenosine. The authors describe that one of the identified SNPs in the murine Entpd2 gene (rs28232063) would lead to increased amounts of available ATP, an immune activator, causing increased cell activation and possibly autoreactive T-cell activation. Use Ensembl to answer the following questions:
-
Where is the SNP located (chromosome and coordinates)?
-
What is the HGVS recommendation nomenclature for this SNP?
-
Why does Ensembl put the G allele first (G/A)?
-
Are there differences between the genotypes reported in C57BL/6NJ and NOD/ShiLtJ, according to the Mouse Genomes Project?
- From the Ensembl homepage, select Mouse from the Species search drop-down and enter
rs28232063in the search box.SNP rs28232063 is located on 2:25288362. In Ensembl, its alleles are provided relative to the forward strand.
- Click on Show under HGVS names to reveal information about HGVS nomenclature.
This SNP has got four HGVS names, one at the genomic DNA level (NC_000068.8:g.25288362G>A), two at the transcript level (ENSMUST00000148859.2:n.444-182G>A and ENSMUST00000028328.3:c.446G>A) and one at the protein level (ENSMUSP00000028328.3:p.Arg149Gln).
- In Ensembl, the allele that is present in the reference genome assembly is always put first.
G is the allele for the reference mouse genome strain C57BL/6J
- Click on Sample genotypes is the left-hand panel. The table shows genotypes reported for different mouse strains from the Mouse Genomes Project.
There are indeed differences between the genotypes reported in those two different strains. The genotype reported in C57BL/6NJ is G/G whereas in NOD/ShiLtJ the genotype is A/A.
Variation data in tomato
-
Go to Ensembl Plants and find the Solyc02g084570.3 gene in Solanum lycopersicum (tomato) and go to its Location tab. Can you see the variation track?
-
Zoom in around the last exon of this gene. What are the different types of variants seen in that region? Are any splice region variants mapped in the region? If so, what is/are the coordinate(s)?
- Select Solanum lycopersicum from the Species search drop-down menu and search for
Solyc02g084570.3. In the results page, you can click on the coordinates 2:48284598-48288482 to go straight to the Location tab. Scroll down to the Region in detail view. The variation track is shown at the bottom of the view.If you don’t see the Variation - All sources track, click Configure this page on the left-hand panel, search for the track in the pop-up menu and enable the track by clicking on the square next to the track name. Close the pop-up window and wait for the track to load.
- Zoom in around the last exon of this gene by drawing a box in the respective region (you can change your mouse action by clicking the Drag/Select icons at the top right-hand corner of the view). Note the gene is on the reverse strand (this is signified by the < sign next to the transcript name, and it is located below the Contigs track), so the last exon will be on the left hand side of that image. The variation legend is shown at the bottom of the page, telling you what the colours mean.
The types of variants seen in that region are 3’ UTR, missense, synonymous and splice region variants.
Splice region variants are shown in orange. Click on the variants to get additional information on that variant including location. You can zoom into the region if the variant block is too small to click.
The variants are found at 2:48285642 and 2:48285640-48285641. Note that the two variants overlap: one is a SNP and the other is an indel. SNPs are tagged with ambiguity codes (zoom into the region if you cannot see this). You can find a useful IUPAC ambiguity code guid on the bioinformatics.org website. Single-letter ambiguity codes are given when two or more possible nucleotides may be represented at a single base locus.
Variation data in Fusarium oxysporum
-
How many species in Ensembl Fungi have variation data?
-
Select Fusarium oxysporum (FO2) and search for the FOXG_13574T0 gene. One of its upstream variants is SNP tmp_10_6610. What are the possible alleles for this polymorphic position? Which one is on the reference genome?
-
What is the most frequent allele at this position?
-
Which samples have the genotypes C|T and T|T?
- Go to Ensembl Fungi, click on View full list of all species. You can sort the table by column. Click on the Variation database column to sort the table by species with variation data.
The table shows that we have 8 fungi species currently with variation databases.
- Click on Fusarium oxysporum in the table and on the species page search for
FOXG_13574T0. From the Gene tab, click on Variant table in the left-hand panel. You can use the filter at the top right-hand corner of the tabletmp_10_6610.The alleles are C/T, where C is the reference allele.
- Click on tmp_10_6610 in the table to open the Variant tab. Then click on Genotype frequency from the menu on the left-hand side of the page.
The most frequent allele at this position is C with a frequency of 0.850.
- Click on Sample genotypes in the menu on the left.
The table shows that sample 909454 has the C|T genotype and 909455 has the T|T genotype.