Command-line VEP analysis of variants against the CHM13 Telomere to Telomere Genome
Please note - it is best to attempt this practical AFTER attempting the practical “VEP CMD Human” for 1000 Genomes Variants on the GRCh38 human reference genome.
Ensembl’s Variant Effect Pedirctor (VEP) is a powerful tool for annotating genomic variants. VEP is accessible via web, REST API and command line options.
In this practical session, we will practice using VEP via command line to annotate a variant call file (VCF) containing a short list of variants which were mapped to the CHM13 Telomere to Telomere Genome. This example file is available here
For more info, there is a blog about running Ensembl VEP for alternative human assemblies and a guide for running VEP on the HPRC assemblies.
You will require the following files from Ensembl FTP directory for VEP for the CHM13 Assembly
The GFF file (contains gene annotations) and its index:
curl -O https://ftp.ebi.ac.uk/pub/ensemblorganisms/Homo_sapiens/GCA_009914755.4/vep/ensembl/geneset/2022_07/genes.gff3.bgz
curl -O https://ftp.ebi.ac.uk/pub/ensemblorganisms/Homo_sapiens/GCA_009914755.4/vep/ensembl/geneset/2022_07/genes.gff3.bgz.csi
The fasta sequence file (bgzipped) for CHM13 and its index :
curl -O https://ftp.ebi.ac.uk/pub/ensemblorganisms/Homo_sapiens/GCA_009914755.4/vep/genome/softmasked.fa.bgz
curl -O https://ftp.ebi.ac.uk/pub/ensemblorganisms/Homo_sapiens/GCA_009914755.4/vep/genome/softmasked.fa.bgz.gzi
If you havent as yet, get the example file into the practical: curl -O https://ftp.ebi.ac.uk/pub/databases/ensembl/training/output_files/example_vars_CHM13T2T.txt
Tutorials for using VEP:
A General Tutorial for command line VEP is available to try out or compare to if you needsome guidance.
You can use the options guide at Ensembl VEP Options to help find the commands you need.
Note - the commands below do not make use of VEP caches, this is for the sake of easier install for this tutorial, but in practical applications, VEP caches will greatly improve run speed.
Exercises:
Ex 1. Explore the example file, how many variants are listed in the file? Are any multi-allelic?
Ex 2. Use the command-line VEP tool to annotate the variants in the file and output format with the GFF and sequence fasta. Include the gene symbol. Here is an example command:
vep -i [input_file] -o [output_file] --gff [gene annotation file] --fasta [genomic sequence fasta file]
Ex 3. Explore the output of VEP in any text viewer to explore the contents. Are the identifiers of the genes the same as for what you found in GRCh38? What are the consequences of variants are present in the file?
Ex 4. Now run a filter VEP command to filter for variants in BRCA and canonical transcripts. How many variants remain?
Exercise 1
You can open the example file with any text editor, or by using a command such as: less example_vars_CHM13T2T.txt. Then use the arrow keys to navigate. When done, use the Q key to quit without saving. There are 5 variants in this example file. rs1230223847 lists two possible allele changes, A > C and A > T.
Exercise 2
vep -i example_vars_CHM13T2T.txt --gff genes.gff3.bgz --fasta softmasked.fa.bgz -o chm13_example_variants_annotated.txt --symbol
Your own script may not look exactly like this and you may employ different flags, here are some common examples:
--input_fileor-iAllows you to specify the location of the input file.--output_fileor-oAllows you to specify the name of the output file.--force_overwriteAllows VEP to overwrite a pre-existing output file with the same name.--genomesPoints VEP to the Ensembl Genomes (non-vertebrates) server. (not used for this homo sapiens example)--databaseEnable VEP to use local or remote databases.--cacheEnables the use of the cache (this can speed up VEP significantly).--cache_versonAllows you to specify the cache version. This should be used with Ensembl Genomes caches, since their version numbers do not match Ensembl versions. For example, the VEP/Ensembl version may be 110 and the Ensembl Genomes version 57.--check_existingChecks for the existence of known variants that are co-located with your input variants.--offlineEnables offline mode (no database connections are made).
Exercise 3
You may use a similar less command or method to open the annotated VCF file:
less -S chm13_example_variants.txt
There are two genes listed in the annotations, ZAR1L ENSG05220054524 and BRCA2 ENSG05220054575. These are different IDs to those in GRCh38. The consequences include: 5_prime_UTR_variant downstream_gene_variant frameshift_variant intron_variant missense_variant splice_acceptor_variant synonymous_variant upstream_gene_variant
Exercise 4
filter_vep -i chm13_example_variants_annotated.txt --filter "(SYMBOL is BRCA2) and (CANONICAL = YES)" -o chm13_example_variants_annotated_filtered.txt
There are six variants remaining in the file. The multi-allelic rs1230223847 is listed for each of its alleles. Congratulations - you can now annotate variants on the Telomere to Telomere human genome!
For further reading on how aligning reads to the the T2T-CHM13 genome, rather than GRCh38, before carrying out variant calling is potentially more powerful, please see: https://doi.org/10.1126/science.abl3533