Command-line VEP analysis of variants from a 1000 Genomes Project dataset
Ensembl’s Variant Effect Predictor (VEP) is a powerful tool for annotating genomic variants. VEP is accessible via web, REST API and command line options.
In this practical session, we will practice using VEP via command line to annotate a variant call file (VCF).
The VCF you will use contains variant calls for Homo Sapiens chromosome 13 from the IGSR: The International Genome Sample Resource. This file was extracted using Ensembl’s Data Slicer for human variation aligned to GRCh38, with the IGSR British in England and Scotland GBR population samples subsetted.
Directories and starting tutorials:
A general tutorial for command line VEP is available to try out or compare to if you need some guidance.
VEP is installed in the /home/training/ensembl-vep
directory. Change to this directory to complete the exercises below.
The subsetted IGSR VCF file name is 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf
Exercises:
-
Explore the VCF file with any text viewer to explore its contents. What do lines denoted by “##” represent? What are the key headers in the file? You use VCF file specifications as a reminder.
-
VEP is installed in the
/home/training/ensembl-vep
Use the command-line VEP tool annotate the variants in the IGSR VCF file and output to VCF format. Here is an example command./vep -i examples/homo_sapiens_GRCh38.vcf --cache --vcf -o example_homo_sapiens_GRCh38output.vcf
-
Explore the output of VEP in any text viewer to explore the contents. What genes are affected by the variants in the file?
-
VEP also produces an HTML output file - try exploring this file with a browser tool such as Firefox or Chrome. How many variants were annotated? What is the proportion of variants that are of the consequence “missense_variant”?
-
Re-run the command-line VEP tool to annotate the variants in your VCF including if they occur in a MANE and Ensembl Canonical transcript. Save the output of this query into a separate output file in the default text format (omit –vcf).
-
Use the
filter_VEP
tool to find variants that are located within the BRCA2 gene in a MANE transcript. Are there any missense variants present in this filter? -
If you are done with the above, you may try the Web version of VEP on your file by uploading the VCF. Try the different options available and see how these can add different layers of information. Also note the command used in the job completion page as this can help you when adapting to command line!
- You can open the VCF file with gedit, or by using a command such as:
less -S 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf
then use the arrow keys to navigate, and press Q when ready to quit. Lines starting with ## indicate information headers on how the file was processed.
- You can run VEP on the VCF input file using the following script:
./vep -i 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf -o VEP_annot_chr13_GBR.vcf --vcf --cache
Your own script may not look exactly like this and you may employ different flags:
--input_file
or -i
Allows you to specify the location of the input file. --output_file
or -o
Allows you to specify the name of the output file.
--force_overwrite
Allows VEP to overwrite a pre-existing output file with the same name.
--genomes
Points VEP to the Ensembl Genomes (non-vertebrates) server. (not used for this homo sapiens example)
--cache
Enables the use of the cache (this can speed up VEP significantly).
--cache_verson
Allows you to specify the cache version. This should be used with Ensembl Genomes caches, since their version numbers do not match Ensembl versions. For example, the VEP/Ensembl version may be 110 and the Ensembl Genomes version 57.
--check_existing
Checks for the existence of known variants that are co-located with your input variants.
--offline
Enables offline mode (no database connections are made).
View the output for exercise 1 here.
- You may use a similar less command or method to open the annotated VCF file:
less -S VEP_annot_chr13_GBR.vcf
there are two genes listed in the annotations, ZAR1L ENSG00000189167 and BRCA2 ENSG00000139618
- Open the HTML file by selecting to open it with a browser, or directing a browser command to it:
firefox VEP_annot_chr13_GBR.vcf_summary.html
Only 1.1% of variants are missense variants View the output for exercise 4 here.
- Rerun VEP with the following command
./vep -i 13.32315086-32400268.ALL.chr13_GRCh38.genotypes.20170504.GBR.vcf -o VEP_annot_chr13_GBR_MANE_Can.txt --canonical --mane --cache --symbol
View the output for exercise 5 here.
- Filter for BRCA2 and MANE with the following command
filter_vep -i VEP_annot_chr13_GBR_MANE_Can.txt -o VEP_annot_chr13_GBR_MANE_Can_filt_BRCA_MANE.txt --filter "SYMBOL is BRCA2 and MANE_SELECT exists"
View output for exercise 6 here There is a missense variant - rs80358762 - you can explore phenotype information related to that variant here