Command-line VEP analysis of variants against the CHM13 Telomere to Telomere Genome
Please note - it is best to attempt this practical AFTER attempting the practical “VEP CMD Human” for 1000 Genomes Variants on the GRCh38 human reference genome.
Ensembl’s Variant Effect Pedirctor (VEP) is a powerful tool for annotating genomic variants. VEP is accessible via web, REST API and command line options.
In this practical session, we will practice using VEP via command line to annotate a variant call file (VCF) containing a short list of variants which were mapped to the CHM13 Telomere to Telomere Genome. This example file is available here
For more info, there is a blog about running Ensembl VEP for alternative human assemblies and a guide for running VEP on the HPRC assemblies.
Installation of Docker and VEP:
For this tutorial, we will be using VEP as installed by Docker.
To run the analysis on your computer, you will need to have installed Docker. Docker requires you to have admistration controls of the computer you install it on. It is available as a command line tool and via desktop interfaces, please see the Docker guides for more info. You may use Docker for Windows and the Windows subsystem for Linux, Docker for Mac, or Docker for Linux. Please install whichever version suits your system.
Once docker is installed, you will need to install VEP via using Dockerusing the linked instructions.
If you get stuck at these steps, ask your trainer for help, or share a computer with another participant who has it working.
Directories and files required:
You must create a volume space for VEP to access as a linked volume to the docker:
mkdir $HOME/vep_data
Once you have created this directory, you can run the following command to start an interactive job where we will run the following commands in this tutorial.
docker run -t -i -v $HOME/vep_data:/data ensemblorg/ensembl-vep
Note - In the example above, data in $HOME/vep_data will be accessible by both the local machine and VEP Docker. The Ensembl VEP API, plugins and their dependencies (e.g. Perl APIs, Bio::DB::HTS, htslib, …) are already installed in the image.
You will require the following files from Ensembl FTP directory for VEP for the CHM13 Assembly
The GFF file (contains gene annotations) and its index:
curl -O https://ftp.ebi.ac.uk/pub/ensemblorganisms/Homo_sapiens/GCA_009914755.4/vep/ensembl/geneset/2022_07/genes.gff3.bgz
curl -O https://ftp.ebi.ac.uk/pub/ensemblorganisms/Homo_sapiens/GCA_009914755.4/vep/ensembl/geneset/2022_07/genes.gff3.bgz.csi
The fasta sequence file (bgzipped) for CHM13 and its index :
curl -O https://ftp.ebi.ac.uk/pub/ensemblorganisms/Homo_sapiens/GCA_009914755.4/vep/genome/softmasked.fa.bgz
curl -O https://ftp.ebi.ac.uk/pub/ensemblorganisms/Homo_sapiens/GCA_009914755.4/vep/genome/softmasked.fa.bgz.gzi
There is a VEP cache avaiable for CHM13 available to add more VEP annotation info:
curl -O https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/variation/2022_10/indexed_vep_cache/Homo_sapiens-GCA_009914755.4-2022_10.tar.gz
Place this into the vep_data folder and uncompress:
tar xzf Homo_sapiens-GCA_009914755.4-2022_10.tar.gz
If you havent as yet, get the example file into the practical: curl -O https://ftp.ebi.ac.uk/pub/databases/ensembl/training/output_files/example_vars_CHM13T2T.txt
Tutorials for using VEP:
A General Tutorial for command line VEP is available to try out or compare to if you needsome guidance.
You can use the options guide at Ensembl VEP Options to help find the commands you need.
Note - the commands below do not make use of VEP caches, this is for the sake of easier install for this tutorial, but in practical applications, VEP caches will greatly improve run speed.
Exercises:
-
Explore the example file, how many variants are listed in the file? Are any multi-allelic?
-
Use the command-line VEP tool to annotate the variants in the file and output format with the GFF and sequence fasta. Include the gene symbol. Here is an example command:
vep -i [input_file] -o [output_file] --gff [gene annotation file] --fasta [genomic sequence fasta file]
-
Explore the output of VEP in any text viewer to explore the contents. Are the identifiers of the genes the same as for what you found in GRCh38? What are the consequences of variants are present in the file?
-
Now include the use of the cache, and include the option to label canonical transcripts. You will need to include the options to direct to the cache
--species homo_sapiens_gca009914755v4 --cache_version 107
-
Now run a filter VEP command to filter for variants in BRCA and canonical transcripts. How many variants remain?
Exercise 1
You can open the example file with any text editor, or by using a command such as: less example_vars_CHM13T2T.txt
. Then use the arrow keys to navigate. When done, use the ESC key to switch to normal mode, then write :q! (quit without saving). There are 5 variants in this example file. rs1230223847 lists two possible allele changes, A > C and A > T.
Exercise 2
vep -i example_vars_CHM13T2T.txt --gff genes.gff3.bgz --fasta softmasked.fa.bgz -o chm13_example_variants.txt --symbol
Your own script may not look exactly like this and you may employ different flags, here are some common examples:
--input_file
or-i
Allows you to specify the location of the input file.--output_file
or-o
Allows you to specify the name of the output file.--force_overwrite
Allows VEP to overwrite a pre-existing output file with the same name.--genomes
Points VEP to the Ensembl Genomes (non-vertebrates) server. (not used for this homo sapiens example)--database
Enable VEP to use local or remote databases.--cache
Enables the use of the cache (this can speed up VEP significantly).--cache_verson
Allows you to specify the cache version. This should be used with Ensembl Genomes caches, since their version numbers do not match Ensembl versions. For example, the VEP/Ensembl version may be 110 and the Ensembl Genomes version 57.--check_existing
Checks for the existence of known variants that are co-located with your input variants.--offline
Enables offline mode (no database connections are made).
Exercise 3
You may use a similar less command or method to open the annotated VCF file:
less -S chm13_example_variants.txt
There are two genes listed in the annotations, ZAR1L ENSG05220054524 and BRCA2 ENSG05220054575. These are different IDs to those in GRCh38. The consequences include: 5_prime_UTR_variant downstream_gene_variant frameshift_variant intron_variant missense_variant splice_acceptor_variant synonymous_variant upstream_gene_variant
Exercise 4
Rerun VEP with the following command:
vep -i example_vars_CHM13T2T.txt --offline --species homo_sapiens_gca009914755v4 --cache_version 107 --fasta chm13_softmasked.fa.bgz --symbol --canonical -o example_vars_CHM13T2T_cache_CAN.txt
Exercise 5
filter_vep -i example_vars_CHM13T2T_cache_CAN.txt --filter "(SYMBOL is BRCA2) and (CANONICAL = YES)" -o example_vars_CHM13T2T_FILT_CAN_BRCA.txt
There are six variants remaining in the file. The multi-allelic rs1230223847 is listed for each of its alleles. Congratulations - you can now annotate variants on the Telomere to Telomere human genome!