Ensembl TrainingEnsembl Home

CABANA Workshop: Plant bioinformatics - analysis of crop genomics data, 13-17 December 2021

Course Details

Lead Trainer
Ben Moore
Associate Trainers
Event Dates
2021-12-13 until 2021-12-17
Location
  Virtual
Description
Work with the Ensembl Outreach team to get to grips with the Ensembl Plants browser, accessing gene, variation and comparative genomics data.

Demos and exercises

Ensembl Plants genes and transcripts

The front page of Ensembl Plants is found at plants.ensembl.org. It contains lots of information and links to help you navigate Ensembl Plants:

At the top left you can see the current release number and what has come out in this release.

Click on View full list of all species.

Click on the common name of your species of interest to go to the species homepage. We’ll click on Triticum aestivum.

Here you can see links to example pages and to download flatfiles. To find out more about the genome assembly and genebuild, click on More information and statistics.

Here you’ll find a detailed description of how to the genome was produced and links to the original source. You will also see details of how the genes were annotated.

We’re going to look at the wheat TraesCS3D02G273600 gene. From plants.ensembl.org or the wheat species homepage, type _ TraesCS3D02G273600_ into the search bar and click the Go button.

The gene tab

Click on TraesCS3D02G273600 from the search hits. The Gene tab should open:

This page summarises the gene, including its location, name and equivalents in other databases. At the bottom of the page, a graphic shows a region view with the transcripts. We can see exons shown as blocks with introns as lines linking them together. Coding exons are filled, whereas non-coding exons are empty. We can also see the overlapping and neighbouring genes and other genomic features.

There are different tabs for different types of features, such as genes, transcripts or variants. These appear side-by-side across the blue bar, allowing you to jump back and forth between features of interest. Each tab has its own navigation column down the left hand side of the page, listing all the things you can see for this feature.

Let’s walk through this menu for the gene tab. How can we view the genomic sequence? Click Sequence at the left of the page.

The sequence is shown in FASTA format. The FASTA header contains the genome assembly, chromosome, coordinates and strand (1 or -1) – this gene is on the positive strand.

Exons are highlighted within the genomic sequence, both exons of our gene of interest and any neighbouring or overlapping gene. By default, 600 bases are shown up and downstream of the gene. We can make changes to how this sequence appears with the blue Configure this page button found at the left. This allows us to change the flanking regions, add variants, add line numbering and more. Click on it now.

Once you have selected changes (in this example, Show variants and Line numbering) click at the top right.

You can download this sequence by clicking in the Download sequence button above the sequence:

This will open a dialogue box that allows you to pick between plain FASTA sequence, or sequence in RTF, which includes all the coloured annotations and can be opened in a word processor. If you want run a sequence analysis tool, download as FASTA sequence, whereas if you want to analyse the sequence visually, RTF is best for this. This button is available for all sequence views.

To find out what the protein does, have a look at GO terms from the Gene Ontology consortium. There are three pages of GO terms, representing the three divisions in GO: Biological process (what the protein does), Cellular component (where the protein is) and Molecular function (how it does it). Click on GO: Biological process to see an example of the GO pages.

Here you can see the functions that have been associated with the gene. There are three-letter codes that indicate how the association was made, as well as links to the specific transcript they are linked to.

We also have links out to other databases which have information about our genes and may focus on other topics that we don’t cover, like Gene Expression Atlas or OMIM. Go up the left-hand menu to External references:

Demo: The transcript tab

We’re now going to explore the different transcripts of TraesCS3D02G273600. Click on Show transcript table at the top.

Here we can see a list of all the transcripts of TraesCS3D02G273600 with their identifiers, lengths and biotypes. Click on the ID of the largest transcript, HORVU5Hr1G100140.2.

You are now in the Transcript tab for TraesCS3D02G273600.1. We can still see the gene tab so we can easily jump back. The left hand navigation column provides several options for the transcript TraesCS3D02G273600.1 - many of these are similar to the options you see in the gene tab, but not all of them. If you can’t find the thing you’re looking for, often the solution is to switch tabs.

Click on the Exons link. This page is useful for designing RT-PCR primers because you can see the sequences of the different exons and their lengths.

You may want to change the display (for example, to show more flanking sequence, or to show full introns). In order to do so click on Configure this page and change the display options accordingly.

Now click on the cDNA link to see the spliced transcript sequence with the amino acid sequence. This page is useful for mapping between the RNA and protein sequences, particularly genetic variants.

UnTranslated Regions (UTRs) are highlighted in dark yellow, codons are highlighted in light yellow, and exon sequence is shown in black or blue letters to show exon divides. Sequence variants are represented by highlighted nucleotides and clickable IUPAC codes are above the sequence.

Next, follow the General identifiers link at the left. Just like the External References page in the gene tab, this page shows links out to other databases such as RefSeq, UniProtKB, PDBe and others, this time linked to the transcript or protein product, rather than the gene.

If you’re interested in protein domains, you could click on Protein summary to view domains from Pfam, PROSITE, Superfamily, InterPro, and more. These are all plotted against the transcript sequence, with the exons shown in alternating shades of purple at the top of the page. Alternatively, you can go to Domains & features to see a table of the same information.

Demo: The location tab

Click on the Location tab to view the Region in Detail page, which displays the genes and other related data aligned against the reference genome.

Click on the button to view page-specific help. The help pages provide text, labelled images and, in some cases, help videos to describe what you can see on the page and how to interact with it.

The Region in detail page is made up of three images, let’s look at each one in detail.

1) The first image shows the chromosome:

The region we’re looking at is highlighted on the chromosome.

2) The second image shows a 500kb region around our selected region. This view allows you to scroll back and forth along the chromosome.

Click on the Drag/Select button to change the action of your mouse click. Now you can scroll along the chromosome by clicking and dragging within the image. As you do this you’ll see the image below grey out and two blue buttons appear. Clicking on Update this image would jump the lower image to the region central to the scrollable image. We want to go back to where we started, so we’ll click on Reset scrollable image.

3) The third image is a detailed, configurable view of the region.

Here you can see various tracks, which is what we call a data type that you can plot against the genome. Some tracks, such as the transcripts, can be on the forward or reverse strand. Forward stranded features are shown above the blue contig track that runs across the middle of the image, with reverse stranded features below the contig. Other tracks, such as variants, regulatory features or conserved regions, refer to both strands of the genome, and these are shown by default at the very top or very bottom of the view.

You can use click and drag to either navigate around the region or highlight regions of interest, Click on the Drag/Select option at the top or bottom right to switch mouse action. On Drag, you can click and drag left or right to move along the genome, the page will reload when you drop the mouse button. On Select you can drag out a box to highlight or zoom in on a region of interest.

With the tool set to Select, drag out a box around an exon and choose Mark region.

The highlight will remain in place if you zoom in and out or move around the region. This allows you to keep track of regions or features of interest.

We can edit what we see on this page by clicking on the blue Configure this page menu at the left.

This will open a menu that allows you to change the image.

When you launch the view, you will see the tracks that are currently turned on with their names on the left and an info icon on the right, which you can click on to expand the description of the track. Turn them on or off, or change the track style by clicking on the box next to the name. More details about the different track styles are in this FAQ: http://www.ensembl.org/Help/Faq?id=335.

You can find more tracks to add by either exploring the categories on the left, or using the Find a track option at the top left. Type in a word or phrase to find tracks with it in the track name or description.

Let’s add some tracks to this image. Add:

  • EMS-induced mutation variants
  • Type I Transposons/LINE (Repeats: Repbase)

Now click on the tick in the top left hand to save and close the menu. Alternatively, click anywhere outside of the menu. We can now see the tracks in the image.

If the track is not giving you can information you need, you can easily change the way the tracks appear by hovering over the track name then the cog wheel to open a menu. To make it easier to compare information between tracks, such as spotting overlaps, you can move tracks around by clicking and dragging on the bar to the left of the track name.

Due to hybridisations in wheat’s evolutionary history, it has a hexaploid genome with related homoeologous regions. We can compare these with the Polyploid view. Click on the Polyploid view link in the left-hand menu.

This view also allows us to configure the page, as we could with the main region view, so that we can compare other features between the homoeologous chromosomes.

Exploring the Arabidopsis thaliana CCD7 gene

(a) Find the Arabidopsis thaliana CCD7 gene on Ensembl Plants. On which chromosome and which strand of the genome is this gene located?

(b) Where in the cell is the CCD7 protein located?

(c) What is the source of the assigned gene name?

(d) How many transcripts does it have? How long is its longest transcript? How long is the protein it encodes? How many exons does it have? Are any of the exons completely or partially untranslated?

(a) Go to the Ensembl Plants homepage (http://plants.ensembl.org/). Select Arabidopsis thaliana from the species list and type CCD7 in the search box. Click Go. Click on CCD7.

The Arabidopsis CCD7 gene is located on chromosome 2 on the forward strand.

(b) Click on GO: cellular component in the side menu.

The protein is located in the chloroplast.

(c) Click on Summary in the side menu.

The gene name is assigned and imported from TAIR (The Arabidopsis Information Resource).

(d) Click on Show transcript table.

There are three transcripts. The longest one is 2005 base pairs and the length of the encoded protein is 622 amino acids.

Click on the Ensembl Transcript ID AT2G44990 in the transcript table.

It has six exons.

Click on Sequence - Exons in the side menu.

The first and last exons are partially untranslated (sequence shown in orange). This can also been seen from the fact that in the transcript diagrams on the Gene summary and Transcript summary pages the boxes representing the first and last exon are partially unfilled.

Exploring a plant gene (Vitis vinifera, grape)

Start in http://plants.ensembl.org/index.html and select the Vitis vinifera genome.

(a) What GO: biological process terms are associated with the MADS4 gene?

(b) Go to the transcript tab for the only transcript, Vv01s0010g03900.t01. How many exons does it have? Which one is the longest? How much of that is coding?

(c) What domains can be found in the protein product of this transcript? How many different domain prediction methods agree with each of these domains?

(a) Go to http://plants.ensembl.org/index.html.

Select Vitis vinifera from the drop down menu All genomes – select a species or click on View full list of all Ensembl Plants species and then choose V. vinifera.

Type MADS4 and click on the gene link VIT_01s0010g03900. Click on GO: Biological process in the side menu.

There are seven terms listed including GO:0006351, transcription, DNA-templated, and GO:0006355, regulation of transcription, DNA-templated.

(b) Click on the transcript named Vv01s0010g03900.t01 (or on the Transcript tab). Click on Exons in the left hand menu.

There are eight exons. Exon 8 is longest with 303 bp, of which 13 are coding.

(c) Click on either Protein Summary or Domains & features in the left hand menu to see graphically or as a table respectively.

A MADS-box domain near the N-terminus is identified by eight domain prediction methods. A K-box domain near the C-terminus is identified by two. Two coiled-coils are identified by one.

Finding a Triticum aestivum gene

(a) Search for Oxygen evolving enhancer protein from the Ensembl Plants homepage and narrow down your search to Triticum aestivum. How many genes are there with this name in wheat? Why do you think this is? What chromosomes are they on?

(b) Go to the gene on chr2B. How many protein coding transcripts does this gene have?

(a) Start at the Ensembl Plants homepage.

Type Oxygen evolving enhancer protein into the search box then click Go. Choose Triticum aestivum from the species drop-down.

There are two genes named TraesCS2D02G248400 and TraesCS2B02G270300. This is because of the hybridisations in wheat’s evolutionary history. You can see that the two genes occur on chromosomes 2B and 2D.

(b) Click on the gene on chromosome 2B to go to the gene tab. If the transcript table is hidden, click on Show transcript table to see it.

There are two protein coding transcripts.

Exploring a defence-related gene in Tomato, Solanum lycopersicum

(a) Search for the tomato gene NCED2 and go to the gene tab.

  • What is the amino acid length of the only transcript of this gene?
  • Which chromosome and which strand of the genome is this gene located?

(b) Look at the gene Description field, what does this tell you about the cellular localisation of the protein product of this gene? Does this match the Gene Ontology (GO): Cellular component terms? Click on GO:Cellular component to check.

(c) Click on Gene expression. Which tissue has the highest expression of this gene according to the Tomato Genome Consortium?

(d) The summary at the top of the page (just above the Show transcript table button) shows us that there are nine paralogues of this gene. Click on the Gene gain/loss tree to look at the expansion of this gene family across all plants.

  • Which species has the largest number of members of this gene family?
  • Do any plants lack any genes in this family?

(e) Go to the transcript tab for this gene by clicking on the transcript ID Solyc08g016720.1.1 from the transcript table. Are there any Oligo probes that would be useful in targeting this gene experimentally?

(a) Go to plants.ensembl.org and type NCED2 into the search box, selecting Solanum locypersicum from the drop down menu. Click on the first result to go to the gene tab.

Click on the Show transcript table button if the transcript table is hidden. In the 4th column we see the protein length listed, 581 amino acids in length.

The location is listed at the top of the page, we can see that this is on Chromosome 8, between the base pairs 8,729,953 and 8,731,698, and on the forward strand.

(b) The gene description for this gene is ‘9-cis-epoxycarotenoid dioxygenase NCED2, chloroplastic’ which suggests the enzyme is localised to the chloroplast.

In the left-hand navigation panel, find the link to GO: Cellular location. We can see three results, chloroplast, plastid and chloroplast stroma, so this matches the gene description.

(c) Click on Gene expression in the left-hand navigation panel.

Darker shades of blue indicate higher expression. Hover your mouse over the heat-map to show a pop-up with the TPM (Transcripts Per Kilobase Million).

The 2cm fruit in the Tomato Genome Consortium has the highest expression at 103 TPM. You can also click on Filters at the top right and filter to high or medium expression.

(d) Click on the Gene gain/loss tree. You might find it easier to compare in the radial tree, click the two arrows icon at the top left of the image () to toggle to the radial view.

Look for the red lines, indicating the larger number of members and significant expansion. The number of members are listed just before the species name.

Brassica napus (oilseed rape) has the highest number of members in this gene family, nearly double compared to other species in the same genus.

Look for grey lines in the diagram. We can see that Triticum turgidum has no members of this gene family.

(e) Go to the transcript tab for this gene by clicking on the transcript ID Solyc08g016720.1.1 from the transcript table.

Find the Oligo probes link in the left-hand navigation panel. There is a single probe from Affymetrix, the AFFY TomGene, 20363698.

Grape MADS4 region

Go to the Location view for MADS4 in Vitis vinifera.

(a) What is the closest gene to MADS4 in the grape genome? Find the location (base pair coordinates) of the closest gene, and the source of this annotation.

(b) Look at the EST alignments. Where do we see ESTs aligned?

(c) Look at regions of conserved gene order between grapevine and Arabidopsis thaliana by clicking Synteny at the left of the page. How many chromosomes in Arabidopsis show synteny to grapevine chromosome 1 in Ensembl?

(a) Go to http://plants.ensembl.org/index.html

Select Vitis vinifera from the drop down menu All genomes – select a species or click on View full list of all Ensembl Plants species and then choose V. vinifera.

Type MADS4 and hit Go, then click on the location link 1:21368964-21386383.

You should be in the Region in Detail page. Look at the middle image.

The closest gene is VIT_01s0010g03910. Click on it to find the location (Chromosome 1: 21,412,823-21,416,284) and that it is a novel transcript annotated by IGGP.

(b) Stay in the same view. Click on Configure this page, then select EST alignments on the left and choose EST (grape). Save and close the menu.

Most ESTs align to the exons of the gene.

(c) Click Synteny. The default species aligned is rice, choose Arabidopsis thaliana from the drop-down on the right.

There are 4 Arabidopsis chromosomes that show synteny to Grapevine chromosome 1. These are Arabidopsis chromosomes 1, 2, 3 and 5.

Exploring a wheat region

(a) Go to 2D:378720500-378780600 in wheat.

(b) How many genes are in this region? What strand are the genes on?

(c) What tracks can you see that show gene structure? Where did the different tracks come from?

(d) Export the genomic sequence for this region.

(e) Can you view the genomic alignments of the homoeologous regions?

(a) Go to the Ensembl Plants homepage. Select Search: Triticum aestivum and type 2D:378720500-378780600 in the text box. Click Go.

(b) There are two genes displayed in the Genes track. They are both on the reverse strand.

(c) There are two tracks which have mapping to this gene: Genes and Alternative gene models. Hover over the track names for more information on their source.

(d) Click Export data in the side menu. Leave the default parameters as they are. Click Next>. Click on Text.

Note that the sequence has a header that provides information about the genome assembly, the chromosome, the start and end coordinates and the strand. For example:

>2D dna:chromosome chromosome:IWGSC:2D:378720500:378780600:1

(e) Click on Polyploid view in the left hand menu to view the homoeologous regions.

Exploring a genomic region in rice

(a) Go to the region 1:405000-453000 in Oryza sativa Japonica.

(b) Turn on the AGILENT:G2519F-015241 microarray track. Are there any oligo probes that map to this region?

(c) Highlight the region around any reverse strand probes you can see. Do they map to any transcripts?

(a) Go to the Ensembl Plants homepage.

Select Search: Oryza sativa Japonica and type 1:405000-453000. Click Go.

(b) Click on Configure this page to open the menu. You can find the AGILENT:G2519F-015241 track under Oligo probes in the left hand menu, or by using the Find a track box at the top right. Turn on the track then save and close the menu.

As the AGILENT:G2519F-015241 track is stranded, it appears at the top and bottom of the view, in green.

There are five probes mapped to this region on the positive strand and one probe on the reverse strand.

(c) Drag a box around the reverse strand probe then click on Mark region to highlight.

The highlighted region maps to two transcripts: Os01t0107900-02 and Os01t0107900-01

Variation

In any of the sequence views shown in the Gene and Transcript tabs, you can view variants on the sequence. You can do this by clicking on Configure this page from any of these views.

Let’s take a look at the Gene sequence view for OS01G0775500 in rice. Search for OS01G0775500 and go to the Sequence view.

If you can’t see variants marked on this view, click on Configure this page and select Show variants: Yes and show links.

Find out more about a variant by clicking on it.

You can add variants to all other sequence views in the same way.

You can go to the Variation tab by clicking on the variant ID. For now, we’ll explore more ways of finding variants.

To view all the sequence variants in table form, click the Variant table link at the left of the gene tab.

You can filter the table to only show the variants you’re interested in. For example, click on Consequences: All, then select the variant consequences you’re interested in. For display purposes, the table above has already been filtered to only show missense variants.

You can also filter by the different pathogenicity scores and MAF, or click on Filter other columns for filtering by other columns such as Evidence or Class.

The table contains lots of information about the variants. You can click on the IDs here to go to the Variation tab too.

Let’s have a look at variants in the Location tab. Click on the Location tab in the top bar.

Configure this page and open Variation from the left-hand menu.

There are various options for turning on variants. You can turn on variants by source, by frequency, presence of a phenotype or by individual genome they were isolated from. You can also turn on genotyping chips.

Let’s have a look at a specific variant. If we zoomed in we could see the variant rs18335701 in this region, however it’s easier to find if we put rs18335701 into the search box. Click through to open the Variation tab for Oryza sativa Japonica.

The icons show you what information is available for this variant. Click on Genes and regulation, or follow the link on the left.

This page illustrates the genes the variant falls within and the consequences on those genes, including pathogenicity predictors. It also shows data from GTEx on genes that have increased/decreased expression in individuals with this variant, in different tissues. Finally, regulatory features and motifs that the variant falls within are shown.

Let’s look at population genetics. Click on Population genetics in the left-hand menu.

The population allele frequencies are shown by study. Where genotype frequencies are available, these are shown in the tables.

We can see which strains these genotypes were observed in by going to Sample Genotypes. Click on Show for the Duitama et al. 2015 population.

Exploring a SNP in Arabidopsis

The Arabidopsis ATCDSP32 protein is chloroplastic drought-induced stress protein proposed to participate in a process called cell redox homeostasis.

(a) How many variants have been identified in the gene that can cause a change in the protein sequence?

(b) What is the ID of the variant that changes the residue 60 from Alanine to Threonine? What is the location of this SNP in the Arabidopsis genome? What are its possible alleles?

(c) Download the flanking sequence of this SNP in RTF (Rich Text Format). Can you change how much flanking sequence is displayed on the browser?

(d) Does this SNP cause a change at the amino acid level for other genes or transcripts?

(a) Search for ATCDSP32 on the Arabidopsis page in Ensembl Plants. On the left hand side menu of the Gene tab, click on Variant table.

Click on Consequences: All then select only missense variant. This button also indicates that there are 18 of these.

(b) Follow down the AA coord column to find a variant at residue 60. The ID of this variant is ENSVATH05153232, located at position 28549171 on chromosome 1. The two possible alleles at this locus are C and T.

(c) Click on the link ENSVATH05153232. Then click on Flanking sequence in the left hand side menu. Now click on Download sequence and select Rich Text Format (RTF). If you want to change how much flanking sequence is displayed on the browser, go back to the Flanking sequence page, click on the Configuration button and change the length of the sequence. The default settings is 400 bp.

(d) Click on ‘Genes and regulation’ to find out this SNP does not cause a change at the amino acid level for any other genes or transcripts in that genome.

Variation data in the tomato (S. lycopersicum) genome

(a) Find the Solyc02g084570.3 gene in tomato and go to its Location tab. Can you see the variation track?

(b) Zoom in around the last exon of this gene. What are the different types of variants seen in that region? What are the locations of any splice region variants mapped in the region?

(a) Search for Solyc02g084570.3 and click on the Location link in the results page. The variation track is shown at the bottom of the region.

(b) Zoom in around the last exon of this gene by drawing a box in the respective region. Please note the gene is on the reverse strand, so the last exon will be on the left hand side of that image.

The variation legend is shown at the bottom of the page, telling you what the colours mean.

The types of variants seen in that region are 3’ UTR variants, missense variants, synonymous variants and splice region variants.

Splice region variants are shown in orange. Click on the variants to get additional information on that variant including location.

The variants are found at 2:48285642 and 2:48285640-48285641.

Investigating a variant in wheat

(a) Search for the variant BA00369602 on plants.ensembl.org. Is this variant known by any other names?

(b) What gene is affected by this variant? What is the amino acid change?

(c) Which cultivars have the alternative base at this locus?

(a) Start at plants.ensembl.org and put BA00369602 into the search box. Click on BA00369602 in the search results to get to the variation homepage.

Under synonyms, you can see that the variant is also known as AX-94448191 in CerealsDB.

(b) Click on Genes and regulation.

The variant is a missense variant on TraesCS2D02G303800, where it gives a G/D change at position 406.

(c) Click on Sample genotypes. Scroll down the table to see any cultivars with the A allele in the genotype column.

All of the cultivars listed have the genotype G|G.

VEP

We have identified three variants on wheat chromosome 4B: C -> T at 240206468, C -> G at 240199078 and C -> T at 240212229.

We will use the Ensembl VEP to determine:

  • Have my variants already been annotated in Ensembl?
  • What genes are affected by my variants?
  • Do any of my variants affect gene regulation?

Click on Tools in the top green bar from any Ensembl Plants page, then Variant Effect Predictor to open the input form:

Click on Add/remove species and search for Triticum aestivum to choose it.

The data is in VCF format:
chromosome coordinate id reference alternative

Put the following into the Paste data box:
4B 240206468 var1 C T
4B 240199078 var2 C G
4B 240212229 var3 C T

The VEP will automatically detect that the data is in VCF.

There are further options that you can choose for your output. These are categorised as Identifiers, Variants and frequency data, Additional annotations, Predictions, Filtering options and Advanced options. Let’s open all the menus and take a look.

Hover over the options to see definitions.

When you’ve selected everything you need, scroll right to the bottom and click Run.

The display will show you the status of your job. It will say Queued, then automatically switch to Done when the job is done, you do not need to refresh the page. You can edit or discard your job at this time. If you have submitted multiple jobs, they will all appear here.

Click View results once your job is done.

In your results you will see a graphical summary of your data, as well as a table of your results.

The results table is enormous and detailed, so we’re going to go through the it by section. The first column is Uploaded variant. If your input data contains IDs, like ours does, the ID is listed here. If your input data is only loci, this column will contain the locus and alleles of the variant. You’ll notice that the variants are not neccessarily in the order they were in in your input. You’ll also see that there are multiple lines in the table for each variant, with each line representing one transcript or other feature the variant affects.

You can mouse over any column name to get a definition of what is shown.

The next few columns give the information about the feature the variant affects, including the consequence. Where the feature is a transcript, you will see the gene symbol and stable ID and the transcript stable ID and biotype. The IDs are links to take you to the gene or transcript homepage.

This is followed by details on the effects on transcripts, including the position of the variant in terms of the exon number, cDNA, CDS and protein, the amino acid and codon change and pathogenicity scores. Where the variant is known, the ID of the existing variant is listed, with a link out to the variant homepage. The pathogenicity scores are shown as numbers with coloured highlights to indicate the prediction, and you can mouse-over the scores to get the prediction in words.

Above the table is the Filter option, which allows you to filter by any column in the table. You can select a column from the drop-down, then a logic option from the next drop-down, then type in your filter to the following box. We’ll try a filter of Consequence, followed by is then missense_variant, which will give us only variants that change the amino acid sequence of the protein. You’ll notice that as you type missense_variant, the VEP will make suggestions for an autocomplete.

You can export your VEP results in various formats, including VCF. When you export as VCF, you’ll get all the VEP annotation listed under CSQ in the INFO column. After filtering your data, you’ll see that you have the option to export only the filtered data. You can also drop all the genes you’ve found into the Gene BioMart, or all the known variants into the Variation BioMart to export further information about them.

VEP analysis of rice variants

You’ll find a VCF file here. This is a small subset of the outcome of Oryza sativa Japonica Group whole genome sequencing and variant calling experiment.

Analyse the variants in this file with the VEP and determine:

(a) How many genes and transcripts are affected by variants in this file?

(b) Do these variants result in a change in the proteins encoded by any of the Ensembl genes? What genes? What is the amino acid change?

Go to plants.ensembl.org and click on the link Tools at the top of the page. Click on Variant Effect Predictor and select Species: Oryza sativa Japonica Group.

Either click on Choose file and select the file to upload it, or paste the URL for the file into the Or provide file URL: box.

Click Run at the bottom of the page.

When your job is listed as Done, click View Results.

(a) The number of affected genes and transcripts is shown in the summary table at the top.

Eight genes and eight transcripts are affected by these variants.

(b) Use the filters to view only missense variants. The filters are found above the detailed results table in the middle.

Select Consequence and is from the drop-down menus. Then type missense_variant into the box; this will autocomplete. Click Add.

One variant is a missense variant. It causes an L/R change in the gene OS09G0103500.

VEP analysis of wheat variants

You have done a whole genome sequencing and variant calling experiment for Triticum aestivum. You have a VCF file with a small subset of variants from this experiment. Analyse the variants in this file with the VEP and determine:

(a) How many variants were analysed? How many are novel?

(b) How many genes and transcripts are affected by variants in this file?

(c) Do any of the variants have different consequences for different transcripts?

(d) Filter the table to find variants with HIGH impact. How many variants have high impact? Why do you think missense variants are not classified as high impact?

(e) Can you export all the results to a VCF file? Compare it to the input VCF file to see what information the VEP adds.

Go to plants.ensembl.org and click on the link Tools at the top of the page. Click on Variant Effect Predictor and select Species: Triticum aestivum. Either click on Choose file and select the file to upload it, or paste the URL for the file into the Or provide file URL: box.

Click Run at the bottom of the page. When your job is listed as Done, click View Results.

(a) Twenty variants were analysed, of which one is novel.

(b) Only one gene is affected by variants in this file. It has two transcripts which are both affected.

(c) Yes. The novel variant results in a stop_lost in TraesCS3A02G301400.1 and is a downstream_gene_variant for TraesCS3A02G301400.2.

(d) Use the filters to view only variants with HIGH impact. The filters are found above the detailed results table in the middle. Select Impact and is from the drop-down menus. Then type HIGH into the box; this will autocomplete. Click Add.

There are two variants with high impact and both are stop altering. Missense variants are not classified as high impact, because they do not always have significant impacts on protein functions. Usually the protein is still produced. In contrast, stop altering variants affect the protein length, and therefore likely affect the protein function.

(e) At the top right of the table there is an option to download data. Click on VCF for the All option. Open the VCF file you have downloaded in a text editor. You can see that the VEP adds annotation in the INFO column of the VCF file.

Comparative Genomics

Let’s look at the homologues of wheat TraesCS3D02G007500. Search for the gene and go to the Gene tab.

Click on Plant compara: Gene tree, which will display the current gene in the context of a phylogenetic tree used to determine orthologues and paralogues.

Funnels indicate collapsed nodes. We can expand them by clicking on the node and selecting Expand this sub-tree from the pop-up menu.

We can also see the protein alignment of the sub-tree by clicking on Wasabi viewer, which will open a pop-up:

You can download the tree in a variety of formats. Click on the download icon in the bar at the top of the image to get a pop-up where you can choose your format.

We can look at homologues in the Orthologues and Paralogues pages, which can be accessed from the left-hand menu. If there are no orthologues or paralogues, then the name will be greyed out. Click on Plant compara: Orthologues to see the orthologues available in plants.

Choose to see only Eudicotyledons orthologues by selecting the box. The table below will now only show details of Eudicotyledons orthologues. Let’s look at Brassica oleracea.

Here we can see there is a many-to-many relationship between the wheat and Brassica oleracea orthologues. Links from the orthologue allow you to go to alignments of the orthologous proteins and cDNAs. Click on View Sequence Alignments then View Protein Alignment for the first Brassica oleracea orthologue.

The paralogue page and homoeologue (for wheat) pages are structured in the same way as the orthologue page.

Let’s look at some of the comparative genomics views in the Location tab. Go to the region 1:8000-18000 in Oryza sativa Japonica.

We can look at individual species comparative genomics tracks in this view by clicking on Configure this page.

Select BLASTz/LASTz alignments from the left-hand menu to choose alignments between closely related species. Turn on the alignments for all Oryza species:

The alignment is greatest between closely related species. We can see that many rice species (such as Oryza barthii) are fully aligned across the region, but other species have a region around 1:10700-12500 where a different chunk is aligned (such as Oryza punctata).

We can also look at the alignment between species or groups of species as text. Click on Alignments (text) in the left hand menu.

Select Select an alignment to open the alignment menu.

Select Oryza punctata from the alignments list then click Go.

In this case there are eight blocks aligned of different lengths, some of which correspond to the region we saw unaligned in the image. Click on Block 1.

You will see a list of the regions aligned, followed by the sequence alignment. Click on Display full alignment. Exons are shown in red.

To compare with both contigs visually, go to Region comparison.

To add species to this view, click on the blue Select species or regions button. Choose Oryza punctata again then close the menu.

We can view large scale syntenic regions from our chromosome of interest. Click on Synteny in the left hand menu.

Whole genome alignments

(a) Find the HORVU5Hr1G033980 gene for barley and go to the Region in detail page.

(b) Turn on the LASTZ-net alignment tracks for Aegilops tauschii, Brachypodium distachyon, Oryza sativa Japonica and Triticum turgidum. Are there any regions where you can see gaps in in some of the species alignments?

(c) Go to the Region comparison view and compare to Brachypodium distachyon. What occurs at this gap in the alignment?

(d) Is there evidence of any duplicated regions in Brachypodium distachyon compared to barley?

(a) Go to the Ensembl homepage (http://www.ensembl.org/). Select Search: Hordeum vulgare and type HORVU5Hr1G033980 in the search box. Click Go. Click on chr5H:229682099-229689352.

You may want to turn off all tracks that you added to the display in the previous exercises as follows: Click Configure this page in the side menu. Click Reset configuration green button. SAVE and close.

(b) Click Configure this page in the side menu

Click on Comparative genomics in the menu on the left. Select Aegilops tauschii - LASTz net, Brachypodium distachyon - LASTz net, Oryza sativa Japonica - LASTz net and Triticum turgidum - LASTz net. SAVE and close.

There is alignment across most of the region, with some gaps occurring in all four species.

(c) Click on Region comparison in the left-hand menu. Go to the Select species or regions button and add Brachypodium distachyon. Save and close the menu.

The gap in the alignment in barley translates to an even larger gap in Brachypodium distachyon.

(d) On the right-hand side of the alignment there is a region with three pink bars in Hordeum vulgare, indicating that this region maps to three different locations in Brachypodium distachyon, which might indicate a duplication event. This region is around chr5H:229686000-229688000.

To confirm this, go to the Alignments (text) view and select Brachypodium distachyon again.

The blocks show that the region around chr5H:229686000-229688000 matches to neighbouring regions in Brachypodium distachyon, suggesting a duplication.

FUM1 orthologues and gene trees

The fumarase gene (FUM1) in Arabidopsis encodes a protein with mitochondrial targeting information: http://www.uniprot.org/uniprot/P93033.

(a) How many orthologues have been identified for this gene in Ensembl Plants?

(b) Which orthologue has the highest sequence similarity? Look at the Query% ID and Target%ID.

(a) Go to plants.ensembl.org, choose Arabidopsis and search for FUM1. Click on the Gene ID link. Now click on Orthologues under Plant Compara at the left side of the page to see all the 144 orthologous genes.

(b) Click on the triangles in the column headers to sort by identity. The orthologue with the highest sequence similarity is from Arabidopsis halleri.

Finding orthologous genes for a root transporter in rice

Search Ensembl Plants for the gene LOW SILICON RICE 1 (Lsi1) in Rice (Oryza sativa Japonica). This gene is known to code for an aquaporin transporter that facilitates the uptake of silicon and arsenic through the roots. Silicon concentration is highest in grass species, and is associated with defence.

(a) From the gene tab, click on the Plant Compara > Orthologues page. Which plant group has the highest number of 1-to-1 orthologues? Is it the same group that has the highest number of 1-to-many orthologues?

(b) Reduce the orthologues table to look only at wheat (Triticum aestivum) orthologues. Why are there three results for a 1-to-1 orthologue?

(c) Click on the Compare regions link for chromosome 6B region in wheat to go to the Location tab, Region comparison page.

Scroll to the bottom image. How do the gene models compare between the species? Do they have the same number of exons?

(d) Click back to the Gene tab and click on the Gene gain/loss tree page. Which species has the highest number of members of this gene family? Is it a grass? Can you change the view to see a radial tree?

Go to plants.ensembl.org. Look for the main search box highlighted in green. Select Oryza sativa Japonica Group from the drop down box and type in LOW SILICON RICE 1. Click Go and click the first link to go to the gene page.

(a) Find and click the link for the Plant compara > Orthologues page.

The Liliopsida group has 24 1-to-1 orthologues, the only group with 1-to-1 orthologues. This group is synonymous with monocotyledon, so the group that contains the grasses. The Eudicotyledons has the highest number of 1-to-many orthologues, indicating that this gene has been duplicated in the eudicots.

(b) Use the search box at the top right of the Selected orthologues table and start to type in Triticum aestivum, the table should automatically filter.

There are three results, one for each component (A,B,D). Note that these are considered 1-to-1 orthologues, rather than 1-to-many. This is because these genes arose in wheat by hybridisation (allopolyploidy), rather than duplication (autopolyploidy).

(c) Click on Compare regions (found in the 3rd column below the gene identifier) from the 2nd result for component 6B. This takes us to the Location tab. Scroll down to the bottom of the page.

Both genes have five exons and the same structure. This looks unusual because the gene in rice is on the forward strand, while the gene in wheat is on the reverse strand. This is reflected in the crossing green links between the pink alignment blocks.

(d) Click on the Gene: LOW SILICON RICE 1 tab at the top of the page and click on the Gene gain/loss tree link.

Significant expansions are shown with red branches, and the number of genes in the family shown in the count next to the image and species name. We can see that Brassica napus has 22 members in this group.

We can change the tree to radial view by clicking on the icon with two arrows at the top left of the image.