Ensembl TrainingEnsembl Home

ESHG: Using the Ensembl Variant Effect Predictor (VEP)

Course Details

Lead Trainer
Aleena Mushtaq
Event Dates
2022-06-11 until 2022-06-14
Location
  Virtual
Description
This workshop will guide you through the Ensembl/GENCODE and MANE annotation process, allowing you to experience the challenges involved in annotation and transcript selection. You will also learn how to use the Ensembl VEP to map genetic variants onto Ensembl/GENCODE transcripts and other Ensembl annotation including regulatory regions and regions of evolutionary conservation to determine their likely functional effects and filter to prioritise variants

Materials

CC-BY 4.0 logo

Demos and exercises

Variation

In any of the sequence views shown in the Gene and Transcript tabs, you can view variants on the sequence. You can do this by clicking on Configure this page from any of these views.

Let’s take a look at the Gene sequence view for HBB in human. Search for HBB and go to the Sequence view.

If you can’t see variants marked on this view, click on Configure this page and select Show variants: Yes and show links. You may also wish to add a filter to the variants to allow them to load more quickly, we’ll add Filter variants by evidence status: 1000Genomes.

Find out more about a variant by clicking on it.

You can add variants to all other sequence views in the same way.

You can go to the Variation tab by clicking on the variant ID. For now, we’ll explore more ways of finding variants.

To view all the sequence variants in table form, click the Variant table link at the left of the gene tab.

You can filter the table to only show the variants you’re interested in. For example, click on Consequences: All, then select the variant consequences you’re interested in. For display purposes, the table above has already been filtered to only show missense variants.

You can also filter by the different pathogenicity scores and MAF, or click on Filter other columns for filtering by other columns such as Evidence or Class.

The table contains lots of information about the variants. You can click on the IDs here to go to the Variation tab too.

You can also see the phenotypes associated with a gene. Click on Phenotype in the left hand menu.

Open the transcript table and go to HBB-201 ENST00000335295, then click on Haplotypes in the left hand menu.

The Haplotypes view in the transcript tab shows you the actual protein and CDS sequences in 1000 Genomes individuals. This is possible because the 1000 Genomes study has phased genotypes, so we know which alleles occur on which of the chromosome pairs. The table lists all the versions of the protein that occur along with their frequencies, including the reference sequence and sequences with one or more alternative alleles.

Click on one of the haplotypes, we’ll go for 18K>*,​19del{130}, to find out more about it. Here you will see the frequency in the 1000 Genomes subpopulations, the sequence and the 1000 Genomes individuals where this protein is found.

Let’s have a look at variants in the Location tab. Click on the Location tab in the top bar.

Configure this page and open Variation from the left-hand menu.

There are various options for turning on variants. You can turn on variants by source, by frequency, presence of a phenotype or by individual genome they were isolated from. You can also turn on genotyping chips.

Let’s have a look at a specific variant. If we zoomed in we could see the variant rs334 in this region, however it’s easier to find if we put rs334 into the search box. Click through to open the Variation tab.

The icons show you what information is available for this variant. Click on Genes and regulation, or follow the link on the left.

This page illustrates the genes the variant falls within and the consequences on those genes, including pathogenicity predictors. It also shows data from GTEx on genes that have increased/decreased expression in individuals with this variant, in different tissues. Finally, regulatory features and motifs that the variant falls within are shown.

We can also see the variant in the protein structure by clicking on 3D Protein model.

This is a LiteMol viewer, where you can rotate and zoom in on the structure. The variant location is highlighted, so you can see where it lands within the structure.

Let’s look at population genetics. Click on Population genetics in the left-hand menu.

The population allele frequencies are shown by study, including 1000 Genomes and gnomAD. Where genotype frequencies are available, these are shown in the tables.

There are big differences in allele frequencies between populations. Let’s have a look at the phenotypes associated with this variant to see if they are known to be specific to certain human populations. Click on Phenotype Data in the left-hand menu.

This variant is associated with various phenotypes, including sickle cell and malaria resistance. These phenotype associations come from sources including the GWAS catalog, ClinVar, Orphanet and OMIM. Where available, there are links to the original paper that made the association, the allele that is associated with the phenotype and p-values and other statistics.

Human population genetics and phenotype data

The SNP rs1738074 in the 5’ UTR of the human TAGAP gene has been identified as a genetic risk factor for a few diseases.

(a) In which transcripts is this SNP found?

(b) What is the least frequent genotype for this SNP in the Yoruba (YRI) population from the 1000 Genomes phase 3?

(c) What is the ancestral allele? Is it conserved in the 90 eutherian mammals?

(d) With which diseases is this SNP associated? Are there any known risk (or associated) alleles?

(a) Please note there is more than one way to get this answer. Either go to the Variation Table for the human TAGAP gene, and Filter variants to the 5’UTR, or search Ensembl for rs1738074 directly.

Once you’re in the Variation tab, click on the Genes and regulation link or icon.

This SNP is found in four transcripts of TAGAP. It is also intronic to nine non-coding transcripts and up/downstream to 14 non-coding transcripts.

(b) Click on Population genetics at the left of the variation tab. (Or, click on Explore this variation at the left and click the Population genetics icon.)

In Yoruba (YRI), the least frequent genotype is CC at the frequency of 5.6%.

(c) Click on Phylogenetic context.

The ancestral allele is T and it’s inferred from the alignment in primates.

Select the 91 eutherian mammals EPO-Extended alignment and click on Apply.

A region containing the SNP (highlighted in red and placed in the centre) and its flanking sequence are displayed. The T allele is conserved in all but three of the eutherian mammals displayed.

(d) Click Phenotype Data at the left of the Variation page.

This variation is associated with multiple sclerosis, celiac and white blood cell count. There are known risk alleles for both multiple sclerosis and celiac and the corresponding P values are provided. The allele A is associated with celiac disease. Note that the alleles reported by Ensembl are T/C. Ensembl reports alleles on the forward strand. This suggests that A was reported on the reverse strand in the original paper. Similarly, one of the alleles reported for Multiple sclerosis is G.

Exploring a SNP in human

The missense variation rs1801133 in the human MTHFR gene has been linked to elevated levels of homocysteine, an amino acid whose plasma concentration seems to be associated with the risk of cardiovascular diseases, neural tube defects, and loss of cognitive function. This SNP is also referred to as ‘A222V’, ‘Ala222Val’ as well as other HGVS names.

(a) Find the page with information for rs1801133.

(b) Is rs1801133 a Missense variation in all transcripts of the MTHFR gene? What is the amino acid change?

(c) Why are the alleles for this variation in Ensembl given as G/A and not as C/T, as in the literature?

(d) What is the major allele of rs1801133 in different populations?

(e) In which paper(s) is the association between rs1801133 and homocysteine levels described?

(f) According to the data imported from dbSNP, the ancestral allele for rs1801133 is G. Ancestral alleles in dbSNP are based on a comparison between human and chimp. Does the sequence at this same position in other primates confirm that the ancestral allele is G?

(a) Go to the Ensembl homepage (http://www.ensembl.org/).

Type rs1801133 in the Search box, then click Go. Click on rs1801133.

(b) Click on Genes and Regulation in the side menu (or the Genes and Regulation icon).

No, rs1801133 is Missense variant in nine MTHFR transcripts. Please note that this variant is multialleleic with two alternative alleles - as this table displays one consequence per row, each transcript is listed twice.

The amino acid change is A/V for allele A, and A/G for allele C.

(c) In Ensembl, the alleles of rs1801133 are given as G/A/C because these are the alleles in the forward strand of the genome. In the literature, the alleles are given as C/T/G because the MTHFR gene is located on the reverse strand. The alleles in the actual gene and transcript sequences are C/T/G. In Ensembl, the allele that is present in the reference genome assembly is always put first.

(d) Click on Population genetics in the side menu.

In all populations but one, the allele G is the major one. The exception is CLM (Colombian in Medellin; 1000 Genomes).

(e) Click on Phenotype Data in the left hand side menu.

The specific studies where the association was originally described is given in the Phenotype Data table. Links between rs1801133 and homocysteine levels were described in four papers. Click on the pubmed IDs PMID:20031578, PMID:23696881, PMID:30339177 and PMID:23824729 for more details.

(f) Click on Phylogenetic Context in the side menu.

Select Alignment: 9 primates EPO and click Apply.

Gorilla, bonobo, chimp, macaque, gibbon, vervet, crab-eating macaque and mouse lemur all have a G in this position.

Exploring a SNP in mouse

Madsen et al in the paper ‘Altered metabolic signature in pre-diabetic NOD mice’ (PloS One. 2012; 7(4): e35445) have described several regulatory and coding SNPs, some of them in genes residing within the previously defined insulin dependent diabetes (IDD) regions. The authors describe that one of the identified SNPs in the murine Xdh gene (rs29522348) would lead to an amino acid substitution and could be damaging as predicted as by SIFT (http://sift.jcvi.org/).

(a) Where is the SNP located (chromosome and coordinates)?

(b) What is the HGVS recommendation nomenclature for this SNP?

(c) Why does Ensembl put the C allele first (C/T)?

(d) Are there differences between the genotypes reported in NOD/LTJ and BALB/cByJ, according to the PERLGEN panel?

(a) Go to www.ensembl.org, type rs29522348 in the search box. Click on rs29522348 (Mouse Variation).

SNP rs29522348 is located on 17:74231988. In Ensembl, its alleles are provided as in the forward strand.

(b) Click on HGVS names to reveal information about HGVS nomenclature.

This SNP has got five HGVS names, one at the genomic DNA level (NC_000083.7:g.74231988C>T), three at the transcript level (ENSMUST00000024866.6:c.721G>A, ENSMUST00000233162.2:n.738G>A and ENSMUST00000233621.2:c.*284G>A) and one at the protein level (ENSMUSP00000024866.5:p.Val241Ile).

(c) In Ensembl, the allele that is present in the reference genome assembly is always put first (C is the allele for the reference mouse genome, strain C57BL/6J).

(d) Click on Sample genotypes is the left hand side menu. In the summary of genotypes by population, click on Show for PERLEGEN:MM_PANEL2, or search for the two strain names. There are indeed differences between the genotypes reported in those two different strains. The genotype reported in NOD/LTJ is T|T whereas in BALB/cByJ the genotype is C|C.

Variation data in the tomato (S. lycopersicum) genome

(a) Find the Solyc02g084570.3 gene in tomato and go to its Location tab. Can you see the variation track?

(b) Zoom in around the last exon of this gene. What are the different types of variants seen in that region? What are the locations of any splice region variants mapped in the region?

(a) Search for Solyc02g084570.3 and click on the Location link in the results page. The variation track is shown at the bottom of the region.

(b) Zoom in around the last exon of this gene by drawing a box in the respective region. Please note the gene is on the reverse strand, so the last exon will be on the left hand side of that image.

The variation legend is shown at the bottom of the page, telling you what the colours mean.

The types of variants seen in that region are 3’ UTR variants, missense variants, synonymous variants and splice region variants.

Splice region variants are shown in orange. Click on the variants to get additional information on that variant including location.

The variants are found at 2:48285642 and 2:48285640-48285641.

Exploring VNTR in human

Variable number tandem repeats (VNTRs) show high variation in the number of repeats in the population and are commonly used in forensics (DNA fingerprinting) and to study genetic diversity. (a) Go to the region from 3074666 to 3075100 bp on human chromosome 4. Which gene does it overlap? Which exon of this gene falls in this region?

(b) Configure this page to turn on Repeats (low), Simple repeats (Repeats (low)) and Tandem repeats (TRF) tracks in this view. Can you see any repeats in this exon? What tools were used to annotate the repeats according to the track information?

(c) Zoom in on the polyglutamine (PolyQ) tract or (CAG)n to see its sequence. How many CAG repeats can you see in the human reference assembly? Does this tract overlap any phenotype-associated variants? What is the identifier of this variant?

(d) Go to the variant tab of the phenotype-associated variant. What is the consequence ontology of this variant? Does the reference allele match the number of repeats you have just counted? What is the shortest and longest allele?

(e) Are there any phenotypes associated with this variant?

(a) Select Search: Human and type 4:3074666-3075100 in the text box (or alternatively type human 4:3074666-3075100 in the text box). Click Go.

Click on the golden transcript falling in this region. You can see it’s exon 1 of 67 of the huntingtin gene (HTT).

(b) Click Configure this page in the side menu then select: Repeats (low), Simple repeats (Repeats (low)) and Tandem repeats (TRF).

There are two types of tandem repeats in this exon: polyglutamine (PolyQ) tract or (CAG)n and polyproline (PolyP) tract or (CCG)n; annotated by two different methods. Click on the track names to find more about the tools used for annotation: RepeatMasker and Tandem Repeats Finder.

(c) Draw with your mouse a box around the polyglutamine (PolyQ) tract or (CAG)n. Click on Jump to region in the pop-up menu.

There are 19 CAG repeats in the human reference sequence overlapping rs71180116 indicated by a pink bar in the All phenotype-associated - short variants (SNPs and indels) track.

(d) Click on the rs71180116 ID to go to the variant tab. You can see in the summary page that this variant is classified as an inframe insertion. Either click + to show all of the alleles in the summary page or go to the Genes and regulation table. This variant has many alternative alleles which differ in the number of repeats. The first allele in the expanded Alleles section of the summary page or the first allele in the Codons column in the Genes and regulation table is the reference allele. It is composed of 19 CAG repeats just as in the Region in detail view. The shortest allele has 7 repeats, the longest has 55 repeats.

(e) Click on Phenotype data in the side menu. This variant is associated with Huntington disease, a trinucleotide repeat disorder (polyQ disease) caused by a pathogenic number of CAG repeats (above 36 copies) in a coding region of HTT.

Variation data in fungi

(a) How many species in Ensembl Fungi have variation data?

(b) Select Fusarium oxysporum and search for FOXG_13574T0 gene. One of its upstream variants is SNP tmp_10_6610. What are the possible alleles for this polymorphic position? Which one is on the reference genome?

(c) What is the most frequent allele at this position?

(d) Which samples have the genotypes C|T and T|T?

(a) Go to Ensembl Fungi, click on View full list of all species. Click on the upward triangle next to the Variation database column to sort the table by species with variation data.

The table shows that we have eight fungi species currently with variation databases.

(b) Click on Fusarium oxysporum in the table and on the species page search for FOXG_13574T0. From the Gene tab, click on Variant table and then scroll down to find tmp_10_6610 or use the table search box to find it.

The alleles are C/T, where C is the reference allele.

(c) Click on tmp_10_6610 in the table to open the Variant tab. Then click on Genotype Frequency from the menu on the left hand side of the page.

The most frequent allele at this position is C with a frequency of 0.850.

(d) Click on Sample genotypes in the left menu.

The table shows that sample 909454 has the C|T genotype and 909455 has the T|T genotype.

VEP

We have identified 10 variants on human chromosome eight in a patient with neurodegenerative symptoms including dystonia.

We will use the Ensembl VEP to determine:

  • Have my variants already been annotated in Ensembl?
  • What genes are affected by my variants?
  • Do any of my variants affect gene regulation?

Go to the front page of Ensembl and click on the Variant Effect Predictor.

This page contains information about the VEP, including links to download the script version of the tool. Click on Launch VEP to open the input form:

The data is in VCF format:
chromosome coordinate id reference alternative

Put the following into the Paste data box:

  • 8 22410000 var1 G A
  • 8 22412087 var2 C G
  • 8 22408305 var3 T C
  • 8 22409953 var4 C T
  • 8 22414115 var5 C T
  • 8 22406904 var6 GA C
  • 8 22366942 var7 G A
  • 8 22408348 var8 C T
  • 8 22404716 var9 GCTGCTGCTGCTGC GCTGCTGC
  • 8 22404808 var10 T C

The VEP will automatically detect that the data is in VCF.

There are further options that you can choose for your output. These are categorised as Identifiers, Variants and frequency data, Additional annotations, Predictions, Filtering options and Advanced options. Let’s open all the menus and take a look.

Hover over the options to see definitions.

We’re going to select some options:

  • HGVS, annotation of variants in terms of the transcripts and proteins they affect, commonly-used by the clinical community
  • gnomAD (exomes) allele frequencies
  • Protein domains
  • Phenotypes

When you’ve selected everything you need, scroll right to the bottom and click Run.

The display will show you the status of your job. It will say Queued, then automatically switch to Done when the job is done, you do not need to refresh the page. You can edit or discard your job at this time. If you have submitted multiple jobs, they will all appear here.

Click View results once your job is done.

In your results you will see a graphical summary of your data, as well as a table of your results.

The results table is enormous and detailed, so we’re going to go through the it by section. The first column is Uploaded variant. If your input data contains IDs, like ours does, the ID is listed here. If your input data is only loci, this column will contain the locus and alleles of the variant. You’ll notice that the variants are not neccessarily in the order they were in in your input. You’ll also see that there are multiple lines in the table for each variant, with each line representing one transcript or other feature the variant affects.

You can mouse over any column name to get a definition of what is shown.

The next few columns give the information about the feature the variant affects, including the consequence. Where the feature is a transcript, you will see the gene symbol and stable ID and the transcript stable ID and biotype. Where the feature is a regulatory feature, you will get the stable ID and type. For a transcription factor binding motif (labelled as a MotifFeature) you will see just the ID. Most of the IDs are links to take you to the gene, transcript or regulatory feature homepage.

This is followed by details on the effects on transcripts, including the position of the variant in terms of the exon number, cDNA, CDS and protein, the amino acid and codon change, transcript flags, such as MANE, on the transcript, which can be used to choose a single transcript for variant reporting, and pathogenicity scores. The pathogenicity scores are shown as numbers with coloured highlights to indicate the prediction, and you can mouse-over the scores to get the prediction in words. Two options that we selected in the input form are the HGVS notation, which is shown to the left of the image below and can be used for reporting, and the Domains to the right. The Domains list the proteins domains found, and where there is available, provide a link to the 3D protein model which will launch a LiteMol viewer, highlighting the variant position.

Where the variant is known, the ID of the existing variant is listed, with a link out to the variant homepage. In this example, only rsIDs from dbSNP are shown, but sometimes you will see IDs from other sources such as COSMIC. The VEP also looks up the variant in the Ensembl database and pulls back the allele frequency (AF in the table), which will give you the 1000 Genomes Global Allele Frequency and allele frequencies from gnomAD. We can also see ClinVar clinical significance and the phenotypes associated with known variants or with the genes affected by the variants, with the variant ID listed for variant associations and the gene ID listed for gene associations, along with the source of the association.

For variants that affect transcription factor binding motifs, there are columns that show the effect on motifs (you may need to click on Show/hide columns at the top left of the table to display these). Here you can see the position of the variant in the motif, if the change increases or decreases the binding affinity of the motif and the transcription factors that bind the motif.

Let’s sort the table by Impact using the little upward triangle next to the column name (you may need to click on Show/hide columns to find it).

Impact is a subjective classification of the severity of the variant consequence, based on agreement with other variant annotators. It groups variant consequences into four categories: high, moderate, low and modifier. There is no high impact variant on our list, but we have some moderate variant classes such as missense or inframe deletion. Let’s filter the table to get rid of low impact variants now.

Above the table is the Filter option, which allows you to filter by any column in the table. You can select a column from the drop-down, then a logic option from the next drop-down, then type in your filter to the following box. We’ll try a filter of Impact, followed by is then moderate, which will give us only variant consequences, which might change protein effectiveness. You’ll notice that as you type moderate, the VEP will make suggestions for an autocomplete.

We are still left with quite a few variants. As we are dealing with a rare disorder, we can filter them by the allele frequency assuming that varaints causal of rare diseases should be very rare in the population. Choose gnomAD AF as the filter now, followed by < then 0.005, which will leave us with rare variants found below 0.5% frequency in the general population.

Only one variant meets our criteria. It is a benign missense variant reported on three different transcripts of the SLC39A14 gene, including the MANE Select transcript. Note that the allele frequency for some rare or novel variants might be unknown. Let’s take that into account by editing our allele frequency filter. Change the gnomAD AF filter to is not then defined (leave the text box empty to set it to default defined).

We are left with one pathogenic missense variant of unknown frequency reported on the MANE Plus clinical transcript and associated with hypermagnesemia with dystonia, which can potentially explain our clinical case.

You can export your VEP results in various formats, including VCF. When you export as VCF, you’ll get all the VEP annotation listed under CSQ in the INFO column. After filtering your data, you’ll see that you have the option to export only the filtered data. You can also drop all the genes you’ve found into the Gene BioMart, or all the known variants into the Variation BioMart to export further information about them.

VEP CFTR

Resequencing of the genomic region of the human CFTR (cystic fibrosis transmembrane conductance regulator (ATP-binding cassette sub-family C, member 7) gene (ENSG00000001626) has revealed the following variants (alleles defined in the forward strand):

  • G/A at 7: 117,530,985
  • T/C at 7: 117,531,038
  • T/C at 7: 117,531,068

Use the VEP tool in Ensembl and choose the options to see SIFT and PolyPhen predictions. Do these variants result in a change in the proteins encoded by any of the Ensembl genes? Which gene? Have the variants already been found?

Go to www.ensembl.org and click on the link Tools at the top of the page. Currently there are nine tools listed in that page. Click on Variant Effect Predictor and enter the three variants as below:

7 117530985 117530985 G/A
7 117531038 117531038 T/C
7 117531068 117531068 T/C

Note: Variation data input can be done in a variety of formats. See more details here http://www.ensembl.org/info/docs/variation/vep/vep_formats.html

Click Run.

When your job is listed as Done, click View Results.

You will get a table with the consequence terms from the Sequence Ontology project (http://www.sequenceontology.org/) (i.e. synonymous, missense, downstream, intronic, 5’ UTR, 3’ UTR, etc) provided by VEP for the listed SNPs. You can also upload the VEP results as a track and view them on Location pages in Ensembl. SIFT and PolyPhen are available for missense SNPs only. For two of the entered positions, the variations have been predicted to have missense consequences of various pathogenicity (coordinate 117531038 and 117531068), both affecting CFTR. All the three variants have been already described and are known as rs1800077, rs1800078 and rs35516286 in dbSNP other sources (databases, literature, etc).

Viewing structural variants with the VEP

We have details of a genomic deletion in a breast cancer sample in VCF format:

13 32307062 sv1 . <DEL> . . SVTYPE=DEL;END=32908738

(a)  How many genes have been affected?

(b)  Does the SV cause deletion of any complete transcripts?

(c)  Display your variant in the Ensembl browser.

(a) Give your data a name, such as Patient deletion.

Paste 13 32307062 sv1 . <DEL> . . SVTYPE=DEL;END=32908738 into the Paste data field then hit Run.

13 genes have been affected.

(b) Use the Filters, selecting Consequence is transcript_ablation, Add.

Yes, there is deletion of complete transcripts of PDS5B, N4BP2L1, BRCA2, RNY1P4, IFIT1P1, ATP8A2P2, N4BP2L2, N4BP2L2-IT2 and three genes without official symbols: ENSG00000212293, ENSG00000270008 and ENSG00000277151.

(c) To view your variant in the browser click on the location link in the results table 13: 32307062-32908738. The link will open the Region in detail view in a new tab. If you have given your data a name it will appear automatically in red. If not, you may need to Configure this page and add it under Personal Data.