Command-line VEP analysis of variants from a 1,000 Genomes Project dataset

General information

Ensembl’s Variant Effect Predictor (VEP) is a powerful tool for annotating genomic variants. VEP is accessible via web, REST API and command-line. In this practical session, we will practice using VEP via command-line to annotate a variant call file (VCF).

A General Tutorial for command-line VEP is available to try out or compare to if you need some guidance. You can refer to the options guide at Ensembl VEP Options to help find the commands and flags you need.

Getting started

Open a terminal window on your computer. VEP is installed already and should work directly from the terminal command-line. Navigate to your home directory to carry out the exercises. You can navigate to your home directory by running the following command:

cd

The VCF you will use contains variant calls for Homo sapiens chromosome 13 from the International Genome Sample Resource (IGSR). This file was extracted using the Ensembl Data Slicer tool for variants aligned to GRCh38 reference, with the IGSR British in England and Scotland GBR population samples as a subset. The file is available from the Ensembl FTP site.

This IGSR VCF file name is: 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf. You can find the file in the following directory:

/opt/vep/ensembl-vep/exercise

The file was downloaded using the following command:

wget https://ftp.ebi.ac.uk/pub/databases/ensembl/training/output_files/13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf

For optimum performance, Ensembl VEP uses “cache files” or a remote database to read genomic data. Using cache files gives the best performance. The cache for the human GRCh38 genome is located in the following directory:

/opt/vep/ensembl-vep/homo_sapiens

Exercises

Explore the VCF file with any text viewer to explore its contents. What do lines denoted by “##” represent? What are the key headers in the file? You can use VCF specifications as a reminder.
Use the command-line VEP tool to annotate the variants in the IGSR VCF file and output to VCF format. Here is an example command to run VEP:

vep -i [input_file] -o [output_file] --cache --offline

We recommend using the default VEP output format, as it’s easier to read on screen, but you can preserve the output as VCF by using the flag --vcf.

Note that by default, VEP assumes the cache is located in your home directory. If the cache is not located in your home directory, you will need to specify this in your command.
Explore your VEP output in any text viewer to explore the contents. What genes are affected by the variants in the file? Why are there multiple results for each variant?
Let’s annotate our VCF further so that our output includes additional identifier annotations.

Re-run your VEP command from exercise 1 to annotate whether the corresponding transcript is the Canonical transcript of the gene, the MANE Select transcript of the gene and print the corresponding gene symbol. Save the output to a new file. In which column of your output file can you find the additional annotation?
The VEP cache contains allele frequency (AF) data available for known variants from multiple sources. The AF describes how often an allele is present in a population. As this data is from the 1,000 Genomes Project, it already contains AF data from that resource. Use the appropriate option to add AF data from Genome Aggregation Database (gnomAD) genomes and save the output to a new file.
Run filter_vep on the output of exercise 5 to find variants that overlap with the BRCA2 gene and are a MANE_SELECT transcript and have an AF of 0.01 or less in the overall gnomAD genomes population gnomADg_AF.
- Why do you think we have used an AF filter here?
- From the original output of exercise 2, what has been the overall effect of adding annotations and then filtering using that information?
- Are there any missense variants left?

Bonus Exercise!

If you finish all that, try pasting the variant below into web VEP and turn on the pathogenicity predictions for SIFT, PolyPhen, AlphaMissense, REVEL and CADD. What do you see in the results?

13 48040982 . A T

You can open the VCF file with any text editor, or by using a command:

less -S 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf

Use the arrow keys to navigate and press Q to exit. Lines starting with ## indicate information headers, which contain information about the file content and how it was processed.
Here is an example command to annotate the variants in the IGSR VCF file using VEP:

vep -i /opt/vep/ensembl-vep/exercise/13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf -o VEP_annot_chr13_GBR.vcf --cache --dir_cache /opt/vep/ensembl-vep --offline

Your own script may not look exactly like this and you may employ different flags, here are some common examples:

--input_file or -i Allows you to specify the location of the input file.
--output_file or -o Allows you to specify the name of the output file.
--force_overwrite Allows VEP to overwrite a pre-existing output file with the same name.
--genomes Points VEP to the Ensembl Genomes (non-vertebrates) server (not used for this H. sapiens example).
--cache Enables the use of the cache (this can speed up VEP significantly).
--cache_verson Allows you to specify the cache version. This should be used with Ensembl Genomes caches, since their version numbers do not match Ensembl versions. For example, the VEP/Ensembl version may be 110 and the Ensembl Genomes version 57.
dir_cache Allows you to specify the cache directory to use. By default, VEP assumes the cache is in your home directory. In our case, the cache is located in /opt/vep/ensembl-vep/.
--check_existing Checks for the existence of known variants that are co-located with your input variants.
--offline Enables offline mode (no database connections are made).

Once you’ve run our command, you should find 2 output files: the VEP_annot_chr13_GBR.vcf file contains your VEP results and the VEP_annot_chr13_GBR.vcf_summary.html file contains statistics pertaining to the results of your VEP job.
You may use a similar less command or method to open the annotated VCF file:

less -S VEP_annot_chr13_GBR.vcf

There are two genes listed in the annotations, ZAR1L ENSG00000189167 and BRCA2 ENSG00000139618.

A variant may have different effect depending on which gene transcript it falls. There are multiple rows for each variant representing the variant consequence in the different transcripts.
Re-run VEP with the following command:

vep -i /opt/vep/ensembl-vep/exercise/13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf -o VEP_annot_chr13_GBR_MANE_Can.txt --cache --dir_cache /opt/vep/ensembl-vep --offline --canonical --mane --symbol

In your output file, you will find the additional annotation under the Extra column.
Re-run VEP with the following command:

vep -i /opt/vep/ensembl-vep/exercise/13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf -o VEP_annot_chr13_GBR_MANE_Can_gnomadAF.txt --cache --dir_cache /opt/vep/ensembl-vep --offline --canonical --mane --symbol --af_gnomadg
You can filter for BRCA2 and MANE with the following command:

filter_vep -i VEP_annot_chr13_GBR_MANE_Can_gnomadAF.txt -o VEP_annot_FILT.txt -filter "((SYMBOL is BRCA2) and (MANE_SELECT exists) and (gnomADg_AF <= 0.01))"

By adding transcript set, gene symbol and allele frequency information, then using filters based on these, we can reduce the annotated variant set down from 1770 to 78.

To filter to include only the missense results you could add and (Consequence is missense_variant) to the command above.

There is a missense variant - rs80358762 - you can explore phenotype information related to that variant here.

Note that although it’s a rare missense variant in BRCA2, the evidence from multiple ClinVar submitters indicate that it is benign.

Bonus Exercise!

This variant is novel and does not have an accessioned ID and is not present in the allele frequency resources. Consider just the missense variant in NUDT15, you can see that prediction tools we enabled (PolyPhen, AlphaMissense, REVEL and CADD) are all relatively low scores, below those that would normally be considered pathogenic. However, the SIFT score indicate that the protein structure is significantly impacted, suggesting this variant may merit further investigation.