Command-line VEP analysis of variants from a 1000 Genomes Project dataset
Ensembl’s Variant Effect Pedirctor (VEP) is a powerful tool for annotating genomic variants. VEP is accessible via web, REST API and command line options.
In this practical session, we will practice using VEP via command line to annotate a variant call file (VCF).
The VCF you will use contains variant calls for Homo Sapiens chromosome 13 from the IGSR: The International Genome Sample Resource. This file was extracted using Ensembl’s Data Slicer for human variation aligned to GRCh38, with the IGSR British in England and Scotland GBR population samples subsetted. The file is available here
Installation of Docker and VEP:
For this tutorial, we will be using VEP as installed by Docker.
To run the analysis on your computer, you will need to have installed Docker. Docker requires you to have admistration controls of the computer you install it on. It is available as a command line tool and via desktop interfaces, please see the Docker guides for more info. You may use Docker for Windows and the Windows subsystem for Linux, Docker for Mac, or Docker for Linux. Please install whichever version suits your system.
Once docker is installed, you will need to install VEP via using Dockerusing the linked instructions.
If you get stuck at these steps, ask your trainer for help, or share a computer with another participant who has it working.
Directories and files required:
You must create a volume space for VEP to access as a linked volume to the docker:
mkdir $HOME/vep_data
Once you have created this directory, you can run the following command to start an interactive job where we will run the following commands in this tutorial.
docker run -t -i -v $HOME/vep_data:/data ensemblorg/ensembl-vep
Note - In the example above, data in $HOME/vep_data will be accessible by both the local machine and VEP Docker. The Ensembl VEP API, plugins and their dependencies (e.g. Perl APIs, Bio::DB::HTS, htslib, …) are already installed in the image.
The subsetted IGSR VCF file name is: 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf
You can download it into the docker volume with
curl https://ftp.ebi.ac.uk/pub/databases/ensembl/training/output_files/13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf
Tutorials for using VEP:
A General Tutorial for command line VEP is available to try out or compare to if you needsome guidance.
You can use the options guide at Ensembl VEP Options to help find the commands you need.
Note - the commands below do not make use of VEP caches, this is for the sake of easier install for this tutorial, but in practical applications, VEP caches will greatly improve run speed.
Exercises:
- Explore the VCF file with any text viewer to explore its contents. What do lines denoted by “##” represent? What are the key headers in the file?
You can use VCF file specifications as a reminder.
- Use the command-line VEP tool to annotate the variants in the IGSR VCF file and output to VCF format. Here is an example command:
vep -i [input_file] -o [output_file] --database
We recommend using the default VEP output format, as it’s easier to read on screen, but you can preserve the output as VCF by using --vcf
.
-
Explore the output of VEP in any text viewer to explore the contents. What genes are affected by the variants in the file? Why are there multiple results for each variant?
-
The VEP cache contains Allele Frequency (AF) data available for known variants from multiple sources. As this data is from the 1000 genomes it already contains allele frequency data from that resource. Use the appropriate option to add AF data from gnomAD genomes, then write the output to a new file (something like
-o chr13_with_AF.txt
). -
CHECK CHECK Use the filter_vep tool on the output of step 6 to find variants that are:
a. in BRCA2
and
b. are a MANE_SELECT
transcript and
c. which have an AF of 0.01 or less in the 1000 Genomes Phase 3 data AF
.
- Why do you think we have used an AF filter here?
- From the original output of exercise 2, what has been the overall effect of adding annotations and then filtering using that information?
- Are there any missense variants left?
Bonus!
If you finish all that, try pasting the variant below into Web VEP and select the alphamissense, revel, CADD, mavedb and mutfunc options.
What do you see in the results?
13 48040982 . A T
Exercise 1
You can open the VCF file with any text editor, or by using a command such as: vim 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf
. Then use the arrow keys to navigate. When done, use the ESC key to switch to normal mode, then write :q! (quit without saving). Lines starting with ## indicate information headers, which contain information about the file content and how ia=t was processed.
Exercise 2
vep -i 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf -o VEP_annot_chr13_GBR.vcf --database
Your own script may not look exactly like this and you may employ different flags, here are some common examples:
--input_file
or-i
Allows you to specify the location of the input file.--output_file
or-o
Allows you to specify the name of the output file.--force_overwrite
Allows VEP to overwrite a pre-existing output file with the same name.--genomes
Points VEP to the Ensembl Genomes (non-vertebrates) server. (not used for this homo sapiens example)--database
Enable VEP to use local or remote databases.--cache
Enables the use of the cache (this can speed up VEP significantly).--cache_verson
Allows you to specify the cache version. This should be used with Ensembl Genomes caches, since their version numbers do not match Ensembl versions. For example, the VEP/Ensembl version may be 110 and the Ensembl Genomes version 57.--check_existing
Checks for the existence of known variants that are co-located with your input variants.--offline
Enables offline mode (no database connections are made).- ’–af’ Add the global allele frequency (AF) from 1000 Genomes Phase 3 data for any known co-located variant to the output.
Exercise 3
You may use a similar less command or method to open the annotated VCF file:
less -S VEP_annot_chr13_GBR.vcf
There are two genes listed in the annotations, ZAR1L ENSG00000189167 and BRCA2 ENSG00000139618. There are multiple rows for each variant representing the variant consequence in different transcripts.
Exercise 4
Rerun VEP with the following command:
vep -i 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf -o VEP_annot_chr13_GBR_MANE_Can.txt --canonical --mane --symbol --database
Exercise 5
Rerun VEP with the following command:
vep -i 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf -o VEP_annot_chr13_GBR_MANE_Can_gnomadAF.txt --canonical --mane --symbol --af --database
Exercise 6
Filter for BRCA2 and MANE with the following command:
filter_vep -i VEP_annot_chr13_GBR_MANE_Can_gnomadAF.txt -o VEP_annot_FILT.txt \--filter \"((SYMBOL is BRCA2) and (MANE_SELECT exists) and (AF <= 0.01))"
By adding transcript set, gene symbol and allele frequency information, then using filters based on these, we can reduce the annotated variant set down.
To filter to include only the missense results you could add and (Consequence is missense_variant)
to the command above.
There is a missense variant - rs80358762 - you can explore phenotype information related to that variant here. Note that although it’s a rare missense variant in BRCA2, the evidence from multiple ClinVar submitters indicate that it is benign.
Bonus!
This variant is novel and does not have an accessioned ID and is not present in the allele frequency resources. Consider just the missense variant in NUDT15, you can see that prediction tools we enabled(alphamissense, revel and CADD) are all relatively low scores, below those that would normally be thought of as pathogenic. However, some ofthe IntAct and MaveDB scores - which are based on experimental data - indicate that the protein structure is significantly impacted, suggesting this variant may merit further investigation.