Command-line VEP analysis of variants from a 1000 Genomes Project dataset

Ensembl’s Variant Effect Pedirctor (VEP) is a powerful tool for annotating genomic variants. VEP is accessible via web, REST API and command line options.

In this practical session, we will practice using VEP via command line to annotate a variant call file (VCF).

The VCF you will use contains variant calls for Homo Sapiens chromosome 13 from the IGSR: The International Genome Sample Resource. This file was extracted using Ensembl’s Data Slicer for human variation aligned to GRCh38, with the IGSR British in England and Scotland GBR population samples subsetted. The file is available here

Directories and starting tutorials:

A General Tutorial for command line VEP is available to try out or compare to if you needsome guidance.

You can use the options guide at Ensembl VEP Options to help find the commands you need.

Launch a command prompt. VEP is installed already and should work directly from the terminal command line. Navigate to /home/ubuntu/Course_Materials/VEP_CLI/ to carry out the exercises.

The subsetted IGSR VCF file name is: 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf

It should already be available in the course materials area, but if not you can download it:

wget https://ftp.ebi.ac.uk/pub/databases/ensembl/training/output_files/13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf

Exercises:

Explore the VCF file with any text viewer to explore its contents. What do lines denoted by “##” represent? What are the key headers in the file?

You can use VCF file specifications as a reminder.

Use the command-line VEP tool to annotate the variants in the IGSR VCF file and output to VCF format. Here is an example command:

vep -i [input_file] -o [output_file] --cache --offline

We recommend using the default VEP output format, as it’s easier to read on screen, but you can preserve the output as VCF by using --vcf.

Explore the output of VEP in any text viewer to explore the contents. What genes are affected by the variants in the file? Why are there multiple results for each variant?
The VEP cache contains Allele Frequency (AF) data available for known variants from multiple sources. As this data is from the 1000 genomes it already contains allele frequency data from that resource. Use the appropriate option to add AF data from gnomAD genomes, then write the output to a new file (something like -o chr13_with_AF.txt).
Use the filter_vep tool on the output of step 6 to find variants that are:

a. in BRCA2 and

b. are a MANE_SELECT transcript and

c. which have an AF of 0.01 or less in the overall gnomAD genomes population gnomADg_AF.

Why do you think we have used an AF filter here?
From the original output of exercise 2, what has been the overall effect of adding annotations and then filtering using that information?
Are there any missense variants left?

Bonus!

If you finish all that, try pasting the variant below into Web VEP and select the alphamissense, revel, CADD, mavedb and mutfunc options.

What do you see in the results?

13 48040982 . A T

Exercise 1

You can open the VCF file with any text editor, or by using a command such as: less -S 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf. Then use the arrow keys to navigate, and press Q when ready to quit. Lines starting with ## indicate information headers, which contain information about the file content and how ia=t was processed.

Exercise 2

vep -i 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf -o VEP_annot_chr13_GBR.vcf --cache --offline

Your own script may not look exactly like this and you may employ different flags, here are some common examples:

--input_file or -i Allows you to specify the location of the input file.
--output_file or -o Allows you to specify the name of the output file.
--force_overwrite Allows VEP to overwrite a pre-existing output file with the same name.
--genomes Points VEP to the Ensembl Genomes (non-vertebrates) server. (not used for this homo sapiens example)
--cache Enables the use of the cache (this can speed up VEP significantly).
--cache_verson Allows you to specify the cache version. This should be used with Ensembl Genomes caches, since their version numbers do not match Ensembl versions. For example, the VEP/Ensembl version may be 110 and the Ensembl Genomes version 57.
--check_existing Checks for the existence of known variants that are co-located with your input variants.
--offline Enables offline mode (no database connections are made).

Exercise 3

You may use a similar less command or method to open the annotated VCF file:

less -S VEP_annot_chr13_GBR.vcf

There are two genes listed in the annotations, ZAR1L ENSG00000189167 and BRCA2 ENSG00000139618. There are multiple rows for each variant representing the variant consequence in different transcripts.

Exercise 4

Rerun VEP with the following command:

vep -i 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf -o VEP_annot_chr13_GBR_MANE_Can.txt --canonical --mane --cache --symbol --cache --offline

Exercise 5

Rerun VEP with the following command:

vep -i 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf -o VEP_annot_chr13_GBR_MANE_Can_gnomadAF.txt --canonical --mane --cache --symbol --af_gnomadg --offline

Exercise 6

Filter for BRCA2 and MANE with the following command:

filter_vep -i VEP_annot_chr13_GBR_MANE_Can_gnomadAF.txt -o VEP_annot_FILT.txt \--filter \"((SYMBOL is BRCA2) and (MANE_SELECT exists) and (gnomADg_AF <= 0.01))"

By adding transcript set, gene symbol and allele frequency information, then using filters based on these, we can reduce the annotated variant set down from 1770 to 78.

To filter to include only the missense results you could add and (Consequence is missense_variant) to the command above.

There is a missense variant - rs80358762 - you can explore phenotype information related to that variant here. Note that although it’s a rare missense variant in BRCA2, the evidence from multiple ClinVar submitters indicate that it is benign.

Bonus!

This variant is novel and does not have an accessioned ID and is not present in the allele frequency resources. Consider just the missense variant in NUDT15, you can see that prediction tools we enabled(alphamissense, revel and CADD) are all relatively low scores, below those that would normally be thought of as pathogenic. However, some ofthe IntAct and MaveDB scores - which are based on experimental data - indicate that the protein structure is significantly impacted, suggesting this variant may merit further investigation.