Command-line VEP analysis of variants from a 1000 Genomes Project dataset

Ensembl’s Variant Effect Pedirctor (VEP) is a powerful tool for annotating genomic variants. VEP is accessible via web, REST API and command line options.

In this practical session, we will practice using VEP via command line to annotate a variant call file (VCF).

The VCF you will use contains variant calls for Homo Sapiens chromosome 13 from the IGSR: The International Genome Sample Resource. This file was extracted using Ensembl’s Data Slicer for human variation aligned to GRCh38, with the IGSR British in England and Scotland GBR population samples subsetted. The file is available here

Installation of Docker and VEP:

For this tutorial, we will be using VEP as installed by Docker.

To run the analysis on your computer, you will need to have installed Docker. Docker requires you to have admistration controls of the computer you install it on. It is available as a command line tool and via desktop interfaces, please see the Docker guides for more info. You may use Docker for Windows and the Windows subsystem for Linux, Docker for Mac, or Docker for Linux. Please install whichever version suits your system.

Once docker is installed, you will need to install VEP via using Dockerusing the linked instructions.

If you get stuck at these steps, ask your trainer for help, or share a computer with another participant who has it working.

Directories and files required:

You must create a volume space for VEP to access as a linked volume to the docker:

mkdir $HOME/vep_data

Once you have created this directory, you can run the following command to start an interactive job where we will run the following commands in this tutorial.

docker run -t -i -v $HOME/vep_data:/data ensemblorg/ensembl-vep

Note - In the example above, data in $HOME/vep_data will be accessible by both the local machine and VEP Docker. The Ensembl VEP API, plugins and their dependencies (e.g. Perl APIs, Bio::DB::HTS, htslib, …) are already installed in the image.

The subsetted IGSR VCF file name is: 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf

You can download it into the docker volume with

curl https://ftp.ebi.ac.uk/pub/databases/ensembl/training/output_files/13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf

Tutorials for using VEP:

A General Tutorial for command line VEP is available to try out or compare to if you needsome guidance.

You can use the options guide at Ensembl VEP Options to help find the commands you need.

Note - the commands below do not make use of VEP caches, this is for the sake of easier install for this tutorial, but in practical applications, VEP caches will greatly improve run speed.

Exercises:

Explore the VCF file with any text viewer to explore its contents. What do lines denoted by “##” represent? What are the key headers in the file?

You can use VCF file specifications as a reminder.

Use the command-line VEP tool to annotate the variants in the IGSR VCF file. Here is an example command:

vep -i [input_file] -o [output_file] --database --symbol

We recommend using the default VEP output format, as it’s easier to read on screen, but you can preserve the output as VCF by using --vcf.

Explore the output of VEP in any text viewer to explore the contents. What genes are affected by the variants in the file? Why are there multiple results for each variant?
Re-run the command-line VEP tool to annotate the variants in your VCF including if they occur in a MANE and Ensembl Canonical transcript. Save the output of this query into a separate output file in the default text format (omit –vcf).
Use the filter_VEP tool to find variants that are located within the BRCA2 gene in a MANE transcript. Are there any missense variants present in this filter?

From the original output of exercise 2, what has been the overall effect of adding annotations and then filtering using that information?
Are there any missense variants left?

Bonus!

If you finish all that, try pasting the variant below into Web VEP and select the alphamissense, revel, CADD, mavedb and mutfunc options.

What do you see in the results?

13 48040982 . A T

Exercise 1

You can open the VCF file with any text editor, or by using a command such as: vim 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf. Then use the arrow keys to navigate. When done, use the ESC key to switch to normal mode, then write :q! (quit without saving). Lines starting with ## indicate information headers, which contain information about the file content and how it was processed.

Exercise 2

vep -i 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf -o VEP_annot_chr13_GBR.vcf --database

Your own script may not look exactly like this and you may employ different flags, here are some common examples:

--input_file or -i Allows you to specify the location of the input file.
--output_file or -o Allows you to specify the name of the output file.
--force_overwrite Allows VEP to overwrite a pre-existing output file with the same name.
--genomes Points VEP to the Ensembl Genomes (non-vertebrates) server. (not used for this homo sapiens example)
--database Enable VEP to use local or remote databases.
--cache Enables the use of the cache (this can speed up VEP significantly - not used here for install simplicity).
--cache_verson Allows you to specify the cache version. This should be used with Ensembl Genomes caches, since their version numbers do not match Ensembl versions. For example, the VEP/Ensembl version may be 110 and the Ensembl Genomes version 57.
--check_existing Checks for the existence of known variants that are co-located with your input variants.
--offline Enables offline mode (no database connections are made).

Exercise 3

You may use a similar less command or method to open the annotated VCF file:

less -S VEP_annot_chr13_GBR.vcf

There are two genes listed in the annotations, ZAR1L ENSG00000189167 and BRCA2 ENSG00000139618. There are multiple rows for each variant representing the variant consequence in different transcripts.

Exercise 4

Rerun VEP with the following command:

vep -i 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf -o VEP_annot_chr13_GBR_MANE_Can.txt --canonical --mane --symbol --database

Exercise 5

Filter for BRCA2 and MANE with the following command:

filter_vep -i VEP_annot_chr13_GBR_MANE_Can_gnomadAF.txt -o VEP_annot_FILT.txt --filter "((SYMBOL is BRCA2) and (MANE_SELECT exists))"

By adding transcript set and gene symbol then using filters based on these, we can reduce the annotated variant set down.

To filter to include only the missense results you could add and (Consequence is missense_variant) to the command above.

There is a missense variant - rs80358762 - you can explore phenotype information related to that variant here. Note that although it’s a rare missense variant in BRCA2, the evidence from multiple ClinVar submitters indicate that it is benign.

Bonus!

This variant is novel and does not have an accessioned ID and is not present in the allele frequency resources. Consider just the missense variant in NUDT15, you can see that prediction tools we enabled(alphamissense, revel and CADD) are all relatively low scores, below those that would normally be thought of as pathogenic. However, some ofthe IntAct and MaveDB scores - which are based on experimental data - indicate that the protein structure is significantly impacted, suggesting this variant may merit further investigation.