Web VEP analysis of variants from a MaveDB dataset
Introduction
The Ensembl Variant Effect Predictor (VEP) is a powerful tool for annotating genomic variants. Ensembl VEP is accessible via web, REST API and command-line options.
In this practical session, we will practice using Ensembl VEP via the web interface to annotate variants in HGVS format. The variants we will use were from an assay of exon 10 in the Homo sapiens BAP1 gene from the MaveMD collection on MaveDB. You can access the data on MaveDB here via the Scores file.
Ensembl VEP tutorials
A general Ensembl VEP tutorial for the web interface is available on the EMBL-EBI Training website to try out or compare to if you need some guidance. You can use the Ensembl VEP web documentation pages to help find and understand the configurations you will need.
Launch the Ensembl VEP web interface via the Ensembl navigation bar or by clicking here.
Exercises
Ex 1. We will make use of variants from the Scores file mentioned above. Download the Scores file for BAP1 exon 10 from MaveDB, and open it in any tool that can view csv files. Note the “hgvs_nt” and the “scores” columns, as we will use these for input and filtering respectively.
Ex 2. Explore the page info at BAP1 exon 10 from MaveDB.
- What does the distribution of scores look like?
- Using the “Clinical view”, do you note any pattern of Pathogenic or benign variants across score bins?
- What specific transcript ID was used for this assay?
Ex 3. Explore that transcript information by searching it on the Ensembl website. What type of transcript is this? The transcript is flagged as the MANE Select transcript of the BAP1 gene. What does the MANE Select flag tell you?
Ex 4. Use the Ensembl VEP web interface to annotate the BAP1 exon 10 variants from Exercise 1. You will need to supply the list of HGVS format variants, either by extracting that column and uploading as a file, or by copy and pasting in the list of variants. Select the following Ensembl VEP options and any others that interest you:
- enable “gnomAD (genomes) allele frequencies”
- enable “Protein Matches”
- enable “AlphaMissense” pathogenicity prediction
- enable “SpliceAI” splicing predictions
- make sure “Gene Symbol” is enabled
- make sure “MANE” is enabled
When you have enabled those options, submit the job with “Run”.
Ex 5. Once the job completes, explore your Ensembl VEP results on the web interface to answer the below. Download your results in TXT format so you can count the number of results.
- How many of these variants have been observed in the population?
- What is the most common coding consequence?
- How many results are in the output text file, why are there more than the number you submitted?
Ex 6. Use the Ensembl VEP filter options to filter your results for only those corresponding to the feature (transcript ID) to that identified in Exercise 2. Export the filtered TXT file.
Ex 7. Using the scores file from MaveDB, extract the variants with with scores lower than -0.09 only. Use this list to filter your Ensembl VEP result and explore the consequence types of these, are there any patterns that you notice? You may use the awk and grep bash commands for this step to help you, or you can try other ways of merging the input table with your Ensembl VEP results.
Exercise 1
Open the file “urn_mavedb_00000662-k-1_scores.csv” in any text editor or by using a command such as less -S urn_mavedb_00000662-k-1_scores.csv
If you have a csv viewer like libreoffice, googledocs, excel, you can open up this comma delimited file to view. Hint, you can use a command like awk -F',' '{print $2}' urn_mavedb_00000662-k-1_scores.csv > bap1_mdb_hgvs.txt to extract the IDs from column 2 to use as input for Ensembl VEP. Mind to remove the header when pasting/uploading file. In case of server issues, the file used for this practical is available here.
Exercise 2
Most scores are close to 0, in bins ranging between -0.02 and 0.02. The Clinical view shows there are a few pathogenic variants around the score range of 0.10. This assay targeted the Coding Sequence (CDS), splice-site containing intron and 3’UTR of the transcript: ENST00000460680.6.
Exercise 3
It is a protein coding transcript. MANE select transcripts are those where Ensembl/Gencode (ENST) and RefSeq (NM) transcripts that are 100% identical (5’UTR, CDS and 3’UTR) and 3) are highly conserved, expressed and well-supported.
Exercise 4
Your Ensembl VEP job should take a few mins to run. If you find a blank result, check that you have only included the hgvs and no header or other info as input. In case of server issues, a pre-run output file is available here
Exercise 5
The summary table on the top of the Ensembl VEP results show Novel / existing variants 656 (57.1) / 493 (42.9). Missense variant is the most common coding consequence. There are many more result lines than input variants, as Ensembl VEP will report consequences for all transcripts and genes overlapped by the variant coordinate.
19794 lines in vep_bap1_result.txt
Exercise 6
In the filter options, use the drop down menu to select “Feature” “is” and paste ENST00000460680.6 then click “Add”. This limits the result to only those in the same transcript used in the assay. You may export the file, and in case of server issues, the transcript filter file is here
Exercise 7
It’s possible to programming languages and tools to merge the tables, but for this practical, we will use awk and grep to extract to simplify for this exercise. When working at high volumes of variants, we recommend automating steps.
First use awk to filter the table for scores lower than -0.09:
awk -F',' 'NR==1; $3<-0.09' urn_mavedb_00000662-k-1_scores.csv> bap1_exon10_low_scores.txt
That command keeps the header line intact. Run another awk command to extract only the variant names “hgvs_nt”:
awk -F',' '{print $2}' bap1_exon10_low_scores.txt > bap1_exon10_hgvs_low_scores.txt
You can use that file as input into grep to filter the Ensembl VEP results from Exercise 6:
grep -f bap1_exon10_hgvs_low_scores.txt vep_bap1_ex10_maveDB.Feature_is_ENST00000460680.6.txt > vepbap1_low_scores.txt
The consequence type is primarly made up of stop gained and frameshift variants. Do these align with the scores measured by the assay given the assay type? Are any variant types here unexpected?
NB - if you have access to a command line version of Ensembl VEP, this practical can use a command such as: ./vep --af --af_gnomadg --appris --biotype --buffer_size 500 --check_existing --distance 5000 --domains --mane --polyphen b --pubmed --regulatory --show_ref_allele --sift b --species homo_sapiens --symbol --transcript_version --tsl --uploaded_allele --cache --input_file [input_data] --output_file [output_file]
Splice AI and AlphaMissense are plugins so you would need to download and install these to retrieve scores.