Command-line VEP analysis of variants in Oryza sativa (rice) via Docker
In the Mapping and variant calling practical session, you produced a VCF file with variant calls for Oryza sativa chromosome 10 using sequencing reads from the 3,000 Rice Genomes Project.
-
Use the command-line VEP tool via Docker to annotate the variants in your VCF file. You can use your own file from the previous module, but it can also be found within the Docker image at
/home/vep/variant_data/SAMEA2569438.chr10.filt.vcf.gz
. -
Re-run the command-line VEP tool via Docker to annotate the variants in your VCF including SIFT scores and affected protein domains. Save the output of this query into a separate output file.
-
Use the
filter_VEP
tool to find variants that are located within genes in the Disease Resistance Protein (RP) Panther family. -
Use the
filter_VEP
tool to find missense variants that have a deleterious SIFT score and are located within genes in the RP Panther family.
- You can run VEP via Docker using the following script:
docker run -t -i -v $HOME/Desktop/VEP/Plants/vep_data:/data csicunam/bioinformatics_iamz:latest \ vep -i variant_data/SAMEA2569438.chr10.filt.vcf.gz -o /data/output_1.txt --dir /data\ --cache --cache_version 57 --genomes --species oryza_sativa --force_overwrite --check_existing
Your own script may not look exactly like this and you may employ different flags:
--input_file
or-i
Allows you to specify the location of the input file. If your input file is located within your local working directory, don’t forget to specify this by preceding the file name with/data/
(this is the equivalent to your local directory).
--output_file
or-o
Allows you to specify the name of the output file.
--force_overwrite
Allows VEP to overwrite a pre-existing output file with the same name.
--genomes
Points VEP to the Ensembl Genomes (non-vertebrates) server.
--cache
Enables the use of the cache (this can speed up VEP significantly).
--cache_verson
Allows you to specify the cache version. This should be used with Ensembl Genomes caches, since their version numbers do not match Ensembl versions. For example, the VEP/Ensembl version may be 110 and the Ensembl Genomes version 57.
--check_existing
Checks for the existence of known variants that are co-located with your input variants.
--offline
Enables offline mode (no database connections are made).View the output for exercise 1 here.
- Use the same query as in the previous exercise with 2 additional flags (
--sift b
and--domains
):docker run -t -i -v $HOME/Desktop/VEP/Plants/vep_data:/data csicunam/bioinformatics_iamz:latest \ vep -i variant_data/SAMEA2569438.chr10.filt.vcf.gz -o /data/output_SIFT_and_domains.txt --dir /data\ --cache --cache_version 57 --genomes --species oryza_sativa --check_existing --sift b --domains
The options are as follows:
--sift b
Returns the score and prediction for the SIFT algorithm, which predicts the pathogenicity of missense variants upon protein function.
--domains
Adds the names of the overlapping protein domains to the VEP output.View the output for exercise 2 here.
- The following script uses the
filter_VEP
tool to find variants that are located within genes in the RP Panther family:docker run -t -i -v $HOME/Desktop/VEP/Plants/vep_data:/data csicunam/bioinformatics_iamz:latest \ filter_vep -i /data/output_SIFT_and_domains.txt -o /data/output_filtered_e3.txt \ --filter "domains matches PTHR23155"
View the output for exercise 3 here.
- The following script uses the
filter_VEP
tool to find missense variants that have a deleterious SIFT score and are located within genes in the RP Panther family:docker run -t -i -v $HOME/Desktop/VEP/Plants/vep_data:/data csicunam/bioinformatics_iamz:latest \ filter_vep -i /data/output_SIFT_and_domains.txt -o /data/output_filtered_e4.txt \ --filter "domains matches PTHR23155" --filter "SIFT is deleterious"
View the output for exercise 4 here.