Ensembl TrainingEnsembl Home

Ensembl Plants Genome Browser Workshop – International Rice Research Institute (IRRI)

Course Details

Lead Trainer
Louisse Paola Mirabueno
Event Date
2025-12-09
Location
  Laguna, Philippines
Description
Work with the Ensembl Outreach team to get to grips with the Ensembl Plants browser, accessing gene, regulation and comparative genomics data.
Survey
 Ensembl Plants Genome Browser Workshop – International Rice Research Institute (IRRI) Feedback Survey

Demos and exercises

Species and Genomes

The front page of Ensembl Plants is found at plants.ensembl.org. It contains lots of information and links to help you navigate Ensembl Plants:

At the top left you can see the current release number and what has come out in this release.

Click on View full list of all species.

Click on the common name of your species of interest to go to the species homepage. We’ll click on Arabidopsis thaliana.

Here you can see links to example pages and to download flatfiles. To find out more about the genome assembly and genebuild, click on More information and statistics.

Here you’ll find a detailed description of how to the genome was produced and links to the original source. You will also see details of how the genes were annotated.

Oryza sativa Japonic (rice) gene counts

Find the species Oryza sativa Japonica in Ensembl Plants. How many coding and non-coding genes does it have?

Select Oryza sativa Japonica from the homepage to go to its species information page. Click on More information and statistics.

Oryza sativa Japonica has 37,960 coding and 1,011 non-coding genes.

Oryza sativa (Rice) cultivars

  1. How many Oryza sativa genomes are available in Ensembl Plants?

  2. How many cultivars are available for the Oryza sativa Japonica group?

  3. What is the GCA ID of Oryza sativa (Geng/Japonica-trop2 var. Ketan Nangka)?

  1. Go to Ensembl Plants and click on View full list of all species on homepage. Enter Oryza sativa in the filter in the top right-hand corner of the table.

    17 O. sativa genomes are available in Ensembl Plants.

  2. Go back to the species list and search for Oryza sativa Japonica using the table’s filter. Click on Oryza sativa Japonica Group to open the species information page. Under the Genome assembly section, look for the number of cultivars.

    15 additional cultivars are available.

  3. Click on View list of cultivars. Look for the Ketan Nangka cultivar.

    The GCA ID is GCA_009831275.1.

Exploring genomic regions

Demo: Exploring genomic regions in Ensembl Plants

Start at the Ensembl Plants front page. You can search for a region by typing it into a search box, but you have to specify the species.

To bypass the text search, you need to input your region coordinates in the correct format, which is chromosome, colon, start coordinate, dash, end coordinate, with no spaces for example: 1D:41289600-41345600. Choose Triticum aestivum from the species drop-down, then type (or copy and paste) these coordinates into the search box.

Press Enter or click Go to jump directly to the Region in detail Page.

Click on the button to view page-specific help. The help pages provide text, labelled images and, in some cases, help videos to describe what you can see on the page and how to interact with it.

The Region in detail page is made up of three images, let’s look at each one in detail.

  1. The first image shows the chromosome:

The region we’re looking at is highlighted on the chromosome. You can jump to a different region by dragging out a box in this image. Drag out a box on the chromosome, a pop-up menu will appear.

If you wanted to move to the region, you could click on Jump to region (### bp). If you wanted to highlight it, click on Mark region (###bp). For now, we’ll close the pop-up by clicking on the X in the corner.

  1. The second image shows a 1Mb region around our selected region. This is always 1Mb in human, but the fixed size of this view varies between species. This view allows you to scroll back and forth along the chromosome.

You can also drag out and jump to or mark a region.

Click on the X to close the pop-up menu.

Click on the Drag/Select button to change the action of your mouse click. Now you can scroll along the chromosome by clicking and dragging within the image. As you do this you’ll see the image below grey out and two blue buttons appear. Clicking on Update this image would jump the lower image to the region central to the scrollable image. We want to go back to where we started, so we’ll click on Reset scrollable image.

  1. The third image is a detailed, configurable view of the region.

Here you can see various tracks, which is what we call a data type that you can plot against the genome. Some tracks, such as the transcripts, can be on the forward or reverse strand. Forward stranded features are shown above the blue contig track that runs across the middle of the image, with reverse stranded features below the contig. Other tracks, such as variants, regulatory features or conserved regions, refer to both strands of the genome, and these are shown by default at the very top or very bottom of the view.

You can use click and drag to either navigate around the region or highlight regions of interest, Click on the Drag/Select option at the top or bottom right to switch mouse action. On Drag, you can click and drag left or right to move along the genome, the page will reload when you drop the mouse button. On Select you can drag out a box to highlight or zoom in on a region of interest.

With the tool set to Select, drag out a box around an exon and choose Mark region.

The highlight will remain in place if you zoom in and out or move around the region. This allows you to keep track of regions or features of interest.

We can edit what we see on this page by clicking on the blue Configure this page menu at the left.

This will open a menu that allows you to change the image.

There are thousands of possible tracks that you can add. When you launch the view, you will see all the tracks that are currently turned on with their names on the left and an info icon on the right, which you can click on to expand the description of the track. Turn them on or off, or change the track style by clicking on the box next to the name. More details about the different track styles are in this FAQ.

You can find more tracks to add by either exploring the categories on the left, or using the Find a track option at the top left. Type in a word or phrase to find tracks with it in the track name or description.

Let’s add some tracks to this image. Add:

  • EMS-induced mutation variants
  • Type I Transposons/LINE (Repeats: Repbase)

Now click on the tick in the top left hand to save and close the menu. Alternatively, click anywhere outside of the menu. We can now see the tracks in the image.

If the track is not giving you can information you need, you can easily change the way the tracks appear by hovering over the track name then the cog wheel to open a menu. To make it easier to compare information between tracks, such as spotting overlaps, you can move tracks around by clicking and dragging on the bar to the left of the track name.

Now that you’ve got the view how you want it, you might like to show something you’ve found to a colleague or collaborator. Click on the Share this page button to generate a link. Email the link to someone else, so that they can see the same view as you, including all the tracks you’ve added. These links contain the Ensembl release number, so if a new release or even assembly comes out, your link will just take you to the archive site for the release it was made on.

To return this to the default view, go to Configure this page and select Reset configuration at the bottom of the menu.

Due to hybridisations in wheat’s evolutionary history, it has a hexaploid genome with related homoeologous regions. We can compare these with the Polyploid view. First, let’s zoom in on the gene TraesCS1D02G061000 by dragging out a box around it and clicking on Jump to region. Now click on the Polyploid view link in the left-hand menu.

This view also allows us to configure the page, as we could with the main region view, so that we can compare other features between the homoeologous chromosomes.

Exploring a wheat region

  1. Go to 2D:378720500-378780600 in Triticum aestivum (wheat).

  2. How many genes are in this region? What strand are the genes on? What are the gene IDs for these genes?

  3. What tracks can you see that show gene structure? Where did the different tracks come from?

  4. Export the genomic sequence for this region.

  5. Can you view the genomic alignments of the homoeologous regions? What are the different formats you can export the image as?

  1. Go to the Ensembl Plants homepage. Select Search: Triticum aestivum and type 2D:378720500-378780600 in the text box. Click Go.

  2. There are two genes displayed in the Genes track. They are both located on the reverse strand. The IDs are

  3. There are two tracks which have mapping to this gene: Genes and Alternative gene models. Click the track names for more information on their source.

  4. Click Export data in the left-hand menu. Leave the default parameters as they are. Click Next>. Click on Text. Note that the sequence has a header that provides information about the genome assembly, the chromosome, the start and end coordinates and the strand. For example:
    >2D dna:chromosome chromosome:IWGSC:2D:378720500:378780600:1

  5. Click on Polyploid view in the left hand menu to view the homoeologous regions. Click on Export image. This will open a pop-up menu of the different image formats you can export, which are PNG and PDF.

Exploring a genomic region in Oryza sativa Japonica (rice)

Go to the Ensembl Plants homepage and do the following:

  1. Go to the region between 405000 and 453000 on chromosome 1 in Oryza sativa Japonica.

  2. Turn on the AGILENT:G2519F-015241 microarray track. Are there any oligo probes that map to this region?

  3. Highlight the region around any reverse strand probes you can see. Do they map to any Ensembl transcripts?

  1. Go to the Ensembl Plants homepage. Select Oryza sativa Japonica from the Species drop-down list and type 1:405000-453000. Click Go.

  2. Click on Configure this page to open the menu. You can find the AGILENT:G2519F-015241 track under Oligo probes in the left-hand menu, or by using the Find a track box at the top right. Turn on the track as Normal then save and close the menu. As the AGILENT:G2519F-015241 track is stranded, it appears at the top and bottom of the view.

    There are 5 probes mapped to this region on the positive strand and one probe on the reverse strand.

  3. Drag a box around the reverse strand probe then click on Mark region to highlight.

    The highlighted region maps to two transcripts: Os01t0107900-02 and Os01t0107900-01

Genes and transcripts

Now let’s search for a rice gene. Enter OS01G0775500 into the search box on the rice species landing page (below) or in the top right-hand corner of the page.

The Search results page displays a single result corresponding to our gene of interest.

The Gene tab

Click on the link to open the Gene tab for OS01G0775500. On this page you’ll see a table presenting information on the transcript encoded by the OS01G0775500 gene, as well as a graphical model of the gene.

Look through the menu at left to see the different data and annotations available for this gene. You can access sequences; comparative genomics analyses such as alignments, phylogenetic trees and homologue predictions; GO annotations reflecting the function of the protein encoded by the gene; variation data; gene expression data; and links to related data in external repositories.

We’ll start by clicking Sequence to view the FASTA sequence of OS01G0775500.

The sequence is marked up such that you can see all exons in the region (orange highlight), as well as exons of OS01G0775500 (bold red font and orange highlighting).

The FASTA header indicates the genome assembly version (IRGSP-1.0), the chromosome (1), the genomic coordinates (32811649:32815394) and the strand (-1, which indicates that the gene is transcribed from the reverse strand).

To download or BLAST the sequence, click the buttons immediately above it.

The Download sequence button is also available below the left-hand menu.

Clicking it brings up a popup showing options to download sequences in FASTA or RTF format (RTF format preserves markup and can be edited in a word processor).

We can change the display of the FASTA sequence by clicking the Configure this page button, at the left:

This loads a popup with a variety of customisation options. For example, you can adjust the amount of flanking sequence displayed, and you can also opt to view variants annotated along the sequence. Click the tickmark in the upper right-hand corner, or anywhere outside the popup, to close the window.

In addition to gene-specific information, we can also access transcript- and genomic-location–specific details.

The Transcript tab

Let’s start by clicking the transcript ID OS01G0775500-01 in the transcript table. This opens the Transcript tab, which presents information specific to the transcript. A full list of available data is accessible from the menu at left.

To view the full sequence of the transcript, click Sequence > Exons:

The sequence is colour-coded to indicate whether it is coding or non-coding, and by default, gene variants are identified with coloured highlights. To view spliced sequence, click Sequence > cDNA:

Three tracks, representing the full cDNA sequence (top), the coding sequence (middle), and the translated coding sequence (bottom), are shown. Variant markup is on by default.

Exon and cDNA sequences can be exported by clicking the Download sequence buttons above the sequences.

Exploring the CCD7 gene in Arabidopsis thaliana

  1. Find the Arabidopsis thaliana CCD7 gene on Ensembl Plants. On which chromosome and which strand of the genome is this gene located?

  2. Where in the cell is the CCD7 protein located?

  3. What is the source of the assigned gene name?

  4. How many transcripts does it have? How long is its longest transcript (in bp)? How long is the protein it encodes? How many exons does it have? Are any of the exons completely or partially untranslated?

  1. Go to the Ensembl Plants homepage (http://plants.ensembl.org/). Select A. thaliana from the species list and type CCD7 in the search box. Click Go and click on the gene ID AT2G44990. You can find the strand orientation and the location under Summary in the Gene tab.

    The A. thaliana CCD7 gene is located on chromosome 2 on the forward strand.

  2. Click on GO: Cellular component in the left-hand panel.

    The protein is located in the chloroplast and plastid.

  3. Click on Summary in the side menu.

    The gene name is assigned and imported from NCBI gene (formerly Entrezgene).

  4. Click on Show transcript table.

    There are 3 transcripts. The longest one is 2005 bp and the length of the encoded protein is 622 amino acids.

    Click on the transcript ID AT2G44990.3 in the transcript table. You can find the number of exons in under in the summary information at the top of the page.

    It has 6 exons.

    Click on Sequence: Exons in the left-hand panel.

    The first and last exons are partially untranslated (sequence shown in orange). This can also been seen from the fact that in the transcript diagrams on the Gene Summary and Transcript Summary pages the boxes representing the first and last exon are partially unfilled.

Finding a Triticum aestivum gene

  1. Search for Oxygen evolving enhancer protein from the Ensembl Plants homepage and narrow down your search to Triticum aestivum. How many genes are there with this name in wheat? Why do you think this is? What chromosomes are they on?

  2. Go to the gene on chromosome 2B. How many protein-coding transcripts does this gene have? What is a “canonical transcript”?

  3. Click on the canonical transcript. How many exons does this transcript have? Export the protein sequence of this transcript in the FASTA format.

  1. Start at the Ensembl Plants homepage. Choose Triticum aestivum from the species drop-down, type Oxygen evolving enhancer protein into the search box then click Go.

    There are two genes named TraesCS2D02G248400 and TraesCS2B02G270300. This is because of the hybridisations in wheat’s evolutionary history. You can see that the two genes occur on chromosomes 2B and 2D.

  2. Click on the gene on chromosome 2B to go to the Gene tab. If the transcript table is hidden, click on Show transcript table to see it.

    There are 2 protein coding transcripts.

    Mouse over the Ensembl Canonical flag in the transcripts table to find a description.

    The Ensembl canonical transcript is a single transcript chosen for each gene in each species. It is the most highly conserved, most highly expressed, has the longest coding sequence and is represented in other key resources (e.g. NCBI, UniProt)

  3. Click on TraesCS2B02G270300.2 in the transcript table. You can find the number of exons in the summary description at the top of the Summary page, or you can count the number of boxes (boxes represent exons, lines represent introns) in the Summary diagram.

    TraesCS2B02G270300.2 has 2 exons.

    Go to Sequence: Protein In the left-hand panel.

    Click on the green Download sequence button above the protein sequence. Select FASTA from the drop-down in the pop-up menu and download the sequence to your local machine.

Variation and Ensembl VEP

Exploring variants in rice

Visualising variants in the Sequence view

In any of the sequence views shown in the Gene and Transcript tabs, you can view variants on the sequence. You can do this by clicking on Configure this page from any of these views.

Let’s take a look at the Gene sequence view for OS01G0775500 in rice. Search for OS01G0775500 and go to the Sequence view.

If you can’t see variants marked on this view, click on Configure this page and select Show variants: Yes and show links.

Find out more about a variant by clicking on it.

You can add variants to all other sequence views in the same way.

You can go to the Variation tab by clicking on the variant ID. For now, we’ll explore more ways of finding variants.


Viewing variants within a gene in the tabular form

To view all the sequence variants in table form, click the Variant table link at the left of the gene tab.

You can filter the table to only show the variants you’re interested in. For example, click on Consequences: All, then select the variant consequences you’re interested in. For display purposes, the table above has already been filtered to only show missense variants.

You can also filter by the different pathogenicity scores and MAF, or click on Filter other columns for filtering by other columns such as Evidence or Class.

The table contains lots of information about the variants. You can click on the IDs here to go to the Variation tab too.


Visualising variants in the Region in Detail view

Let’s have a look at variants in the Location tab. Click on the Location tab in the top bar.

Configure this page and open Variation from the left-hand menu.

There are various options for turning on variants. You can turn on variants by source, by frequency, presence of a phenotype or by individual genome they were isolated from. You can also turn on genotyping chips.


Exploring a specific variant

Let’s have a look at a specific variant. If we zoomed in we could see the variant rs18335701 in this region, however it’s easier to find if we put rs18335701 into the search box. Click through to open the Variation tab for Oryza sativa Japonica.

The icons show you what information is available for this variant. Click on Genes and regulation, or follow the link on the left.

This page illustrates the genes the variant falls within and the consequences on those genes, including pathogenicity predictors. It also shows data from GTEx on genes that have increased/decreased expression in individuals with this variant, in different tissues. Finally, regulatory features and motifs that the variant falls within are shown.

Let’s look at population genetics. Click on Population genetics in the left-hand menu.

The population allele frequencies are shown by study. Where genotype frequencies are available, these are shown in the tables.

We can see which strains these genotypes were observed in by going to Sample Genotypes. Click on Show for the Duitama et al. 2015 population.

Demonstration of the VEP web interface

Input

We have identified three variants on wheat chromosome 4B:
C -> T at 240206468
C -> G at 240199078
C -> T at 240212229

We will use the Ensembl VEP to determine:

  • Have my variants already been annotated in Ensembl?
  • What genes are affected by my variants?
  • Do any of my variants affect gene regulation?

Click on Tools in the top green bar from any Ensembl Plants page, then Variant Effect Predictor to open the input form:

Click on Add/remove species and search for Triticum aestivum to choose it.

The data is in VCF:
chromosome coordinate id reference alternative

Put the following into the Paste data box:

4B 240206468 var1 C T  
4B 240199078 var2 C G  
4B 240212229 var3 C T  

The VEP will automatically detect that the data is in VCF.


Additional configurations

There are further options that you can choose for your output. These are categorised as Identifiers, Variants and frequency data, Additional annotations, Predictions, Filtering options and Advanced options. Let’s open all the menus and take a look.

Hover over the options to see definitions.

When you’ve selected everything you need, scroll right to the bottom and click Run.

The display will show you the status of your job. It will say Queued, then automatically switch to Done when the job is done, you do not need to refresh the page. You can edit or discard your job at this time. If you have submitted multiple jobs, they will all appear here.

Click View results once your job is done.


Results

In your results you will see a graphical summary of your data, as well as a table of your results.

The results table is enormous and detailed, so we’re going to go through the it by section. The first column is Uploaded variant. If your input data contains IDs, like ours does, the ID is listed here. If your input data is only loci, this column will contain the locus and alleles of the variant. You’ll notice that the variants are not neccessarily in the order they were in in your input. You’ll also see that there are multiple lines in the table for each variant, with each line representing one transcript or other feature the variant affects.

You can mouse over any column name to get a definition of what is shown.

The next few columns give the information about the feature the variant affects, including the consequence. Where the feature is a transcript, you will see the gene symbol and stable ID and the transcript stable ID and biotype. The IDs are links to take you to the gene or transcript homepage.

This is followed by details on the effects on transcripts, including the position of the variant in terms of the exon number, cDNA, CDS and protein, the amino acid and codon change and pathogenicity scores. Where the variant is known, the ID of the existing variant is listed, with a link out to the variant homepage. The pathogenicity scores are shown as numbers with coloured highlights to indicate the prediction, and you can mouse-over the scores to get the prediction in words.

Above the table is the Filter option, which allows you to filter by any column in the table. You can select a column from the drop-down, then a logic option from the next drop-down, then type in your filter to the following box. We’ll try a filter of Consequence, followed by is then missense_variant, which will give us only variants that change the amino acid sequence of the protein. You’ll notice that as you type missense_variant, the VEP will make suggestions for an autocomplete.

You can export your VEP results in various formats, including VCF. When you export as VCF, you’ll get all the VEP annotation listed under CSQ in the INFO column. After filtering your data, you’ll see that you have the option to export only the filtered data. You can also drop all the genes you’ve found into the Gene BioMart, or all the known variants into the Variation BioMart to export further information about them.






Exploring a SNP in Arabidopsis

The Arabidopsis thaliana ATCDSP32 protein is a chloroplastic drought-induced stress protein proposed to participate in a process called cell redox homeostasis. Go to Ensembl Plants and answer the following questions:

  1. How many variants have been identified in the gene that can cause a change in the protein sequence (i.e. missense variant)?

  2. What is the ID of the variant that changes the amino acid residue 60 from Alanine to Threonine (hint: refer to an amino acid codon table)? What is the location of this SNP in the A. thaliana genome? What are its possible alleles?

  3. Download the flanking sequence of this SNP in RTF (Rich Text Format). Can you change how much flanking sequence is displayed on the browser?

  4. Does this SNP cause a change at the amino acid level for other genes or transcripts?

  1. Click on Arabidospsis thaliana on the Ensembl Plants homepage. Search for ATCDSP32 on the species page and in the search results, click on the Gene ID AT1G76080. In the left-hand side menu of the Gene tab, click on Variant table. Click on Consequences: All then select only missense variant.

    The missense variant button indicates that there are 18 of these. Alternatively, you can count the number of variants in your filtered list.

  2. An amino acid codon table can be found on Wikipedia. Sort the AA coord column by clicking on the header and scroll down to find a variant at residue 60. The ID of this variant is ENSVATH05153232.

    The variant is located at position 28549171 on chromosome 1. The two possible alleles at this locus are C (reference) and T (alternative).

  3. Click on the link ENSVATH05153232, then click on Flanking sequence in the left-hand side menu. Now click on Download sequence and select File format > Rich Text Format (RTF).

    If you want to change how much flanking sequence is displayed on the browser, go back to the Flanking sequence page, click on the Configuration button and change the length of the sequence. The default settings is 400 bp.

  4. Click on Genes and regulation in the left-hand side menu.

    This SNP does not cause a change at the amino acid level for any other genes or transcripts in A. thaliana.

Web VEP analysis of variants in Oryza sativa Japonica (rice)

You’ll find a VCF file here. This is a small subset of the outcome of Oryza sativa Japonica whole-genome sequencing and variant-calling experiment. Analyse the variants in this file with the VEP tool in Ensembl Plants and determine the following:

  1. How many genes and transcripts are affected by variants in this file?

  2. Do these variants result in a change in the proteins encoded by any of the Ensembl genes? Which genes are affected? What is the amino acid change? What is the pathogenicity prediction score for this change?

Go to Ensembl Plants and click on Tools at the top of the page. Click on Variant Effect Predictor and select Oryza sativa Japonica Group from the Species menu.

Either click on Choose file and select the file to upload it, or directly paste the URL into the Or provide file URL: box. Click Run at the bottom of the page. When your job is done, click View results.

  1. The number of affected genes and transcripts is shown in the Summary statistics table at the top.

    8 genes and 8 transcripts are affected by these variants.

  2. Use the filters to view only missense variants. The filters are found above the detailed results table in the middle. Select Consequence and is from the drop-down menus. Then type missense_variant into the boxe. Add to apply your filter.

    1 variant is a missense variant. It causes a leucine to arginine (L/R) at position 16 change in the gene OS09G0103500. The SIFT score is 0.01 (Deleterious low confidence). Refere to this link for more information on SIFT (https://sift.bii.a-star.edu.sg/).

BioMart

Follow these instructions to guide you through BioMart to answer the following query:

  1. What genes are found on chromosome 9, between 15274000 and 15300000 in Oryza sativa Japonica?
  2. What are the NCBI Gene IDs for these genes?
  3. Are there associated functions from the GO (gene ontology) project that might help describe their function?
  4. What are their cDNA sequences?

Click on BioMart in the top header of a plants.ensembl.org page to go to: plants.ensembl.org/biomart/martview

You cannot choose any filters or attributes until you’ve chosen your dataset. Your dataset is the data type you’re working with. In this case we’re going to choose human genes, so pick Ensembl Plants Genes then Oryza sativa Japonica Group genes from the drop-downs.

Now that you’ve chosen your dataset, the filters and attributes will appear in the column on the left. You can pick these in any order and the options you pick will appear.

Click on Filters on the left to see the available filters appear on the main page. You’ll see that there are loads of categories of Filters to choose from. You can expand these by clicking on them. For our query, we’re going to expand REGION.

Our input data is a locus, so we’re going to use the chromosome and coordinates filters. Choose chromosome 9 from the drop-down menu and paste in the start and the end coordinates (15274000 and 15300000). The filters will be autoselected when you add values to them and will appear in the left-hand column.

To check if the filters have worked, you can use the Count button at the top left, which will show you how many genes have passed the filter. If you get 0 or another number you don’t expect, this can help you to see if your query was effective.

To choose the attributes, expand this in the menu. There are five categories for rice gene attributes. These categories are mutually exclusive, you cannot pick attributes from multiple categories. This means that we need to do two separate queries to get our GO terms and NCBI IDs, and to get our cDNA sequences.

The Ensembl gene and transcript IDs are selected by default. The selected attributes are also listed on the left.

We can choose the attributes we want by clicking on them. For our query, we’re going to select:

  • GENE
    • Gene Name
  • EXTERNAL
    • NCBI gene ID
    • GO term accession
    • GO term name
    • GO term definition

We need to select the Gene Name in order to get back our original input, as this is not returned by default in BioMart. The order that you select the attributes in will define the order that the columns appear in in your output table.

You can get your results by clicking on Results at the top left.

The results table just gives you a preview of the first ten lines of your query. This allows the results to load quickly, so that if you need to make any changes to your query, you don’t waste any time. To see the full table you can click on View ## rows. You can also export the data to an xls, tsv, csv or html file. For large queries, it is recommended that you export your data as Compressed web file (notify by email), to ensure your download is not disrupted by connection issues.

You can see multiple rows per gene in your input list, because there are multiple transcripts per gene and multiple GO terms per transcript.

To get the cDNA sequences, go back to the Attributes then select the category Sequences and expand SEQUENCES.

When you select the sequence type, the part of the transcript model you’ve chosen will be highlighted in the grpahic.

Choose cDNA sequences, then expand HEADER INFORMATION to add Gene Name to the header. Then hit Results again.

For more details on BioMart, have a look at this publication:

Kinsella, R.J. et al
Ensembl BioMarts: a hub for data retrieval across taxonomic space.
http://europepmc.org/articles/PMC3170168

Ensembl Plants: finding genes by protein domain

One class of disease resistance (R) genes in plants are the TIR-NBS-LRR genes, that code for proteins that contain an N-terminal Toll/Interleukin receptor homology region (TIR), a nucleotide binding site (NBS) and a C-terminal leucine rich repeat (LRR). TIR-NBS-LRR genes are common in dicots but seem to be rare in monocots (Tarr and Alexander. TIR-NBS-LRR genes are rare in monocots: evidence from diverse monocot orders. BMC Res Notes 2009 Sep 8;2:197).

The ID for the TIR domain in the Pfam (protein family) database is PF01582.

Use BioMart in Ensembl Plants to generate a list of all Solanum tuberosum (potato; a dicot) genes that are annotated to contain a TIR domain. Include the Ensembl stable ID and gene description. Do the same for Zea mays (maize; a monocot).

Do your results confirm the findings of Tarr and Alexander?

  1. Go to Ensembl Plants and click on the link Tools at the top of the page. Click on BioMart. Choose the Ensembl Plants Genes database. Choose the Solanum tuberosum genes dataset.

  2. Now, filter for the genes containing a TIR domain: Click on Filters in the left panel. Expand the PROTEIN DOMAINS AND FAMILIES section. Select Limit to genes with these family or domain IDs and enter PF01582 in the box. Select Pfam ID(s) (e.g. PF00004) from the drop-down menu. Click on Count in the toolbar.

    This should give you 78 / 40336 genes.

  3. Specify the attributes to be included in the output (note that a number of attributes will already be selected by default). Click on Attributes in the left panel. Expand the GENE section. Deselect Transcript stable ID. Select Gene description.

  4. Now click on Results. The first 10 results are displayed by default. You can display all results by selecting View: All rows from the drop-down menu. If you prefer, you can also export as a CSV, TSV or XLS file by using the Export all results to option.

    Repeat the above for the Z. mays genes (B73 RefGen_v4) dataset.

    Your results should show 3 / 44303 genes containing a TIR domain for maize. The results confirm the findings of Tarr and Alexander.

Mapping Uniprot IDs in Ensembl Plants

BioMart is a very handy tool when you want to map between different databases. The following is a list of IDs from the UniProtKB/Swiss-Prot database of Arabidopsis thaliana proteins that are supposedly involved in flavonoid metabolism: P42813, Q9LS08, Q9ZST4, Q9SYM2, P51102, Q9LPV9, Q9FE25, Q96323, Q9FKW3, P13114, P41088, Q9S818, Q96330, O22203, Q39224, O22264, Q9SD85, Q9LYT3, Q9FJA2, Q43128, P43254, O04153, Q43125, Q9S9P6, Q94C57, Q9LNE6, Q9FK25, Q9SYM5, Q9ZQ95

Using BioMart in Ensembl Plants a list that shows to which Ensembl Gene IDs these UniProtKB/Swiss-Prot IDs map. Also include the gene name and description.

  1. Go to BioMart in Ensembl Plants. Click the New button on the toolbar in the top left-hand corner to start a new query. Choose the Ensembl Plants Genes database. Choose the Arabidopsis thaliana genes dataset.

  2. Click on Filters in the left panel. Expand the GENE section. Select ID list limit – UniProt/Swissprot ID(s). Enter the list of IDs in the text box (either comma separated or as a list).

  3. Click on Attributes in the left panel. Expand the GENE section. Deselect Transcript Stable ID. Select Gene name and Gene description. Expand the EXTERNAL section. Select UniProtKB/SwissProt ID(s).

  4. Click the Results button on the toolbar. Select View: All rows as HTML or export all results to a file. Tick the box Unique results only.

    Your results should show 28 / 32833 genes.

Mapping microarray probes to genes

The following is a list of 101 probes that were upregulated after short-term phosphate deprivation of Arabidopsis thaliana. The microarray used was the Arabidopsis thaliana whole genome Affymetrix gene chip (ATH1) (Misson et al. A genome-wide transcriptional analysis using Arabidopsis thaliana Affymetrix gene chips determined plant responses to phosphate deprivation. Proc Natl Acad Sci U S A. 2005 August 16; 102(33): 11934–11939).

259842_at, 251193_at, 259303_at, 252534_at, 266957_at, 257891_at, 263593_at, 266372_at, 265342_at, 254011_at, 260623_at, 262238_at, 264118_at, 256910_at, 263846_at, 249996_at, 248094_at, 267361_at, 246275_at, 258034_at, 248622_at, 263483_at, 254250_at, 257964_at, 248566_s_at, 245263_at, 264636_at, 264342_at, 254125_at, 262369_at, 259399_at, 251770_at, 266132_at, 246001_at, 246075_at, 258887_at, 258856_at, 263391_at, 256376_s_at, 266766_at, 258277_at, 266142_at, 246071_at, 261021_at, 251143_at, 252730_at,249337_at, 258158_at, 245882_at, 250054_at, 263539_at, 263851_at, 247949_at, 262229_at, 246777_at, 258975_at, 247026_at, 252265_at, 256100_at, 246099_at, 246302_at, 254111_at, 256017_at, 259750_at, 254215_at, 253271_s_at, 247314_at, 267567_at, 250435_at, 255543_at, 259479_at, 264783_at, 245193_at, 260561_at, 263948_at, 258682_at, 253386_at, 263847_at, 266017_at, 252414_at, 255360_at, 251176_at, 266743_at, 253829_at, 267497_at, 258613_at, 253163_at, 261648_at, 258100_at, 249983_at, 266413_at, 264261_at, 256627_at, 249640_at, 248164_at, 266184_s_at, 247047_at, 263083_at, 251961_at, 252011_at, 260101_at

(a) Generate a list of the genes to which these probesets map. Include the Ensembl Gene ID, name, description and probe ID attributes.

(b) As a first step in order to be able to analyse them for possible regulatory features they have in common, retrieve the 250 bp upstream of the transcripts of these genes. Include the Ensembl Gene ID, name and description attributes in the sequence header.

(a) Click the New button on the toolbar. Choose the Ensembl Plants Genes database. Choose the Arabidopsis thaliana genes dataset.

Click on Filters in the left panel. Expand the GENE section. Select Input microarray probes/probesets ID list – Affymetrix array Arabidopsis ATH1 121501 ID(s). Enter the list of probeset IDs in the text box (either comma separated or as a list). Click the Count button on the toolbar.

Click on Attributes in the left panel. Expand the GENE section. Deselect Transcript Stable ID. Select Gene name and Gene description. Expand the EXTERNAL section. Select Affymetrix array Arabidopsis ATH1 121501.

Click the Results button on the toolbar. Select View All rows as HTML or export all results to a file. Tick the box Unique results only.

Your results should show 104 / 32833 genes. Apparently there are a few probes that have been mapped to more than one gene.

(b) You can leave the dataset and filters the same, so you can directly specify the attributes:

Click on Attributes in the left panel. Select the Sequences attributes page. Expand the SEQUENCES section. Select Flank (Transcript). Enter 250 in the Upstream flank text box. Expand the HEADER INFORMATION section. Select, in addition to the default selected attributes, Gene name and Gene description.

Note: Flank (Transcript) will give the flanks for all the transcripts of a gene with multiple transcripts. Flank (Gene) will give the flank for the transcript with the outermost 5’ (or 3’) end.

Click the Results button on the toolbar. Select View All rows as FASTA or export all results to a file.