Ensembl TrainingEnsembl Home

Ensembl Browser Workshop - University of Nigeria

Course Details

Lead Trainer
Louisse Paola Mirabueno
Event Date
2023-03-29
Location
  Virtual
Description
Work with the Ensembl Outreach team to get to grips with the Ensembl and Ensembl Genomes browser. Discover how you can access gene, variation and comparative genomics data, and how to retrieve these data with BioMart.
Survey
 Ensembl Browser Workshop - University of Nigeria Feedback Survey

Demos and exercises

Species and genome assemblies

Demo: Introduction to Ensembl

Ensembl

Homepage

The front page of Ensembl is found at ensembl.org. It contains lots of information and links to help you navigate Ensembl:

On the right-hand panel you can see the current release number and what has come out in this release. To access old releases, scroll to the bottom of the page and click on View in archive site in the right-hand corner.

Click on the links to go to the archives. Alternatively, you can jump quickly to the correct release by adding e plus the release number in the URL. For example e98.ensembl.org jumps to Ensembl release 98.  
 
 

Available species

Scroll back up to the top of the homepage. You can view all available species by clicking the View full list of all species link underneath the coloured search block.

You can search for your species of interest (either the common or scientific name) using the search bar at the top right-hand corner of the table. Click on the common name of your species of interest to go to the species information page. We’ll click on Human.

 
 
 

Species information

Here you can see links to example features and to download flatfiles. To find out more about the genome assembly and genebuild, click on More information and statistics under the Genome assembly section.

Here you’ll find a detailed description of how to the genome was produced and links to the original source. You will also see details of how the genes were annotated.

The current genome assembly for human is GRCh38. If you want to see the previous assembly, GRCh37, visit our dedicated site grch37.ensembl.org.

 
 
 

Ensembl Genomes

Homepage

Let’s take a look at the Ensembl Genomes homepage at ensemblgenomes.org.

Click on the different taxa to see their homepages. Each one has a different colour-coding, but they are all structured in a similar format to the Ensembl main site.

You can navigate most of the taxa in the same way as you would with Ensembl.  
 
 

Ensembl Bacteria

Ensembl Bacteria has a large number of genomes and has a slightly different method to the other Ensembl sites. Let’s look at it in more detail.

There’s no drop-down species list for bacteria as it would be hard to navigate with the number of species. You can click the View full list of all Ensembl Bacteria species link underneath the coloured search block. Search for your species of interest using the filter in the top right-hand corner of the table.

Alternatively, you can find a species by typing the species name into the Search for a genome search box at the top of the page. A drop-down list will appear with any species matching the name you entered.

For example, to find a sub-strain of Clostridioides difficile start typing in the species name. Due to the auto-complete, you’ll see useful results as soon as you get to Clostridio.

The drop down contains various strains of C. difficile. Let’s choose C. difficile 630. This will take us to another species information page, where we can explore various features.

Unlike the Homo sapiens species information page, there is no prose description of the genome or gene annotation, as these pages were generated automatically.  
 
 

Ensembl Rapid Release

Our newest genomes, such as those coming from the Darwin Tree of Life, are available rapid.ensembl.org with limited annotation.

Finding a genome in Ensembl Bacteria

Mycobacterium tuberculosis H37Ra str. ATCC25177 is a clinical strain.

Go to Ensembl Bacteria and find the species M. tuberculosis H37Ra str. ATCC25177. How many coding genes does it have?

In the Ensesmbl Bacteria homepage, start to type H37Ra into the Search for a genome search box (you can find this in the coloured block at the top of the homepage). It will auto-complete, allowing you to select M. tuberculosis H37Ra str. ATCC25177 from the drop-down list. Click on More information and statistics.

M. tuberculosis H37Ra str. ATCC25177 has 4,080 coding and 47 non-coding genes.

Fusarium genus

Go to Ensembl Fungi and answer the following questions:

  1. How many genomes of the genus Fusarium are in Ensembl Fungi?

  2. What is the accession ID of the Fusarium graminearum str. CS3005 INSDC assembly?

  3. How many genes have been annotated for F. graminearum str. CS3005?

  1. On the Ensembl Fungi homepage, click on View full list of all Ensembl Fungi species (you can find this under the coloured search block). Type Fusarium into the filter box in the top right-hand corner of the table.

    There are 75 Fusarium genomes in Ensembl Fungi. Many of them are multiple strains of the same species, or multiple assemblies of the same strain/species.

  2. There are several ways to do this. You can: search for F. graminearum str. CS3005 in the species list and find the accession ID under the Accession column; or click on the species name to go to the species information page and click on More information and statistics. The accession ID can be found in the Summary statistics panel on the right-hand side.

    The accession ID of the INSDC assembly is GCA_000599445.1.

  3. Go to the species information page of F. graminearum str. CS3005. and click on More information and statistics. You can find the Gene counts statistics on the right-hand panel.

    There are 13,355 coding genes in F. graminearum str. CS3005.

Oryza sativa Japonic (rice) gene counts

Find the species Oryza sativa Japonica in Ensembl Plants. How many coding and non-coding genes does it have?

Select Oryza sativa Japonica from the homepage to go to its species information page. Click on More information and statistics.

Oryza sativa Japonica has 37,960 coding and 1,011 non-coding genes.

Solanum genus

Go to Ensembl Plants and answer the following questions:

  1. How many genomes of the genus Solanum are there in Ensembl Plants?

  2. When was the current Solanum lycopersicum genome assembly last revised?

  1. On the homepage, click on View full list of all Ensembl Plants species underneath the coloured search block. Type Solanum into the filter box in the top left-hand corner of the table.

    There are three Solanum genomes: Solanum lycopersicum (tomato), and Solanum tuberosum RH89-039-16 and Solanum tuberosum (both potato).

  2. Click on S. lycopersicum, then on More information and statistics.

    The genome was revised in April 2018.

Panda species

Go to Ensembl and find the following information:

  1. What is the name of the genome assembly for Panda?

  2. How long is the Panda genome (in bp)? How many coding genes have been annotated?

  1. Select Giant panda from the drop down species list, or click on View full list of all Ensembl species, then choose Giant panda from the list.
    The assembly is ASM200744v2 or GCA_002007445.2.

  2. Click on More information and statistics. Statistics are shown in the tables on the left.
    The length of the genome is 2,444,060,653 bp.
    There are 20,857 coding genes.

Exploring genomic regions

Region in Detail view

Start at the Ensembl Bacteria homepage, bacteria.ensembl.org. Search for your species of interest either by using the search box, or opening the full list of species by clicking View full list of all Ensembl Bacteria species underneath the search box.

Enter Escherichia coli str. K-12 substr. MG1655 (GCA_000005845) in the search box. Enter Chromosome:3144663-3157453 into the species-specific search box:

Press Enter or click Go to jump directly to the Region in detail page.

Click on the button to view page-specific help. The help pages provide text, labelled images and, in some cases, help videos to describe what you can see on the page and how to interact with it.

The Region in detail page is made up of three images, let’s look at each one on detail.

The first image shows the chromosome:

You can jump to a different region by clicking and dragging the yellow and blue handles.

If you want to move to your highlighted region, you click on the region shaded in red.

The second image shows a 50 kb region (the size varies per genome and depends on the gene size and density; you can find a scale at the top of the view) around our selected region. This view allows you to scroll back and forth along the chromosome.

Click and drag your mouse to highlight a region. A pop-up window will appear with options to jump to or centre on the highlighted region.

Click on the X to close the pop-up menu.

Click on the Drag/Select button to change the action of your mouse click. Now you can scroll along the chromosome by clicking and dragging within the image. As you do this you’ll see the image below grey out and update to your scrolled region. To go back to go back to where you started, you can click the Back button of your browser.

The third image is a detailed, configurable view of the region.

Click on the Drag/Select option at the top or bottom right to switch mouse action. On Drag, you can click and drag left or right to move along the genome, the page will reload when you drop the mouse button. On Select you can drag out a box to highlight or zoom in on a region of interest.

We can edit what we see on this page by clicking on the blue Configure this page menu at the left.

This will open a menu that allows you to change the image.

You can enable tracks in different styles; more details are in the FAQs.

Let’s add the following tracks to our view:

  • Start/stop codons
  • All repeats

Now click on the check icon in the top left-hand corner to save and close the menu. Alternatively, click anywhere outside of the menu. We can now see the tracks in the image.

We can also change the way the tracks appear by clicking on the track name then hovering over the cog wheel to open its menu. We can move tracks around by clicking and dragging on the bar to the left of the track name.

Now that you’ve got the view how you want it, you might like to show something you’ve found to a colleague or collaborator. Click on the Share this page button to generate a URL with your set configurations. Email the link to someone else, so that they can see the same view as you, including all the tracks you’ve added. These links contain the Ensembl release number, so if a new release or even assembly comes out, your link will just take you to the archive site for the release it was made on.

To return this to the default view, go to Configure this page and select Reset configuration at the bottom of the menu.

Exploring a genomic region in Staphylococcus aureus

Go to the Ensembl Bacteria homepage and do the following:

  1. Search for the Staphylococcus aureus subsp. aureus NCTC 8325 (GCA_000013425).

  2. Search for the gene gyrA.

  3. What are the genomic coordinates of this gene? Is gyrA located on the forward or reverse strand?

  4. Name two genes located upstream and downstream of gyrA.

  1. On the Ensembl bacteria homepage, type NCTC 8325 into the Search for a genome box. Click on the auto-completed genome name to navigate to the species information page.

  2. Type gyrA into the search box. Click Go.

  3. The coordinates of the gyrA are 7,005-9,668. The gene is located on the forward strand.

  4. SAOUHSC_00005 (DNA gyrase, B subunit) is located upstream gyrA, and SAOUHSC_00007 (a conserved hypothetical protein) is located downstream gyrA.

Exploring a genomic region in Salmonella enterica

Go to Ensembl Bacteria and do the following:

  1. Search for the Salmonella enterica subsp. enterica serovar Typhi str. Ty2 (GCA_000007545) (Hint: type Ty into the Search for a genome box).

  2. Go to the region Chromosome:2000605-2009742.

  3. How many genes are annotated in this region? How many are on the forward strand? How many are on the reverse strand?

  1. Go to the Ensembl Bacteria homepage. Type Ty2 into the Search for a genome box. Click on the auto-completed genome name to navigate to the species information page.

  2. Type Chromosome:2000605-2009742 into the search box. Click Go.

  3. There are 8 genes annotated in this region, all on the reverse strand.

Exploring a region in Coprinopsis cinerea okayama

Go to Ensembl Fungi. Let’s try to find some information about the region from 1,400,000 to 1,425,000 in chromosome 7 in Coprinopsis cinerea okayama:

  1. How many complete genes are found in this region? How many on the forward and how many on the reverse strand?

  2. Zoom in on the largest gene EFI27358. How many exons does this gene have?

  3. Export the genomic sequence in FASTA format for this region.

  1. In the Ensembl Fungi homepage, select Coprinopsis cinerea okayama from the Species search drop-down. Enter 7:1400000-1425000 in the Search bar and click Go. This will send you to the Location tab. Your region of interest is indicated by a red rectangle in the 50kb view. Look at the Genes track: each block represents a different gene. Count the number of complete genes within the rectangle.

    There are 7 complete genes in the region.

  2. Look at the Region in detail view (the most detailed view at the bottom of the page). You can zoom into a region by clicking and dragging your mouse (you can change your mouse action in the top right-hand corner of the view under **Drag/Select) and selecting Jump to region in the pop-up menu. Count the number of blocks you can see for EFI27358.

    The EFI27358 gene has 23 exons.

    Click on the transcript ID CZT99117 in the transcript table.

    It has 4 exons.

  3. We want to export the genomic sequence for our original region (not just the EFI27358 gene). You can reset the view by entering 7:1400000-1425000 in the Location bar above the Region in detail view or hitting the Back button on your internet browser. Click on Export data in the left-hand panel. In the pop-up menu, select FASTA from the drop-down and click Next >. You can export the sequence as is (text) or as a compressed file (.gz).

    If you choose to download the sequence as text, your browser might open the FASTA file in a new tab. In this case, just right-click on any white space and select Save As… from the menu.

Exploring a genomic region in Oryza sativa Japonica (rice)

Go to the Ensembl Plants homepage and do the following:

  1. Go to the region 1:405000-453000 in Oryza sativa Japonica.

  2. Turn on the AGILENT:G2519F-015241 microarray track. Are there any oligo probes that map to this region?

  3. Highlight the region around any reverse strand probes you can see. Do they map to any transcripts?

  1. Go to the Ensembl Plants homepage. Select Oryza sativa Japonica from the Species drop-down list and type 1:405000-453000. Click Go.

  2. Click on Configure this page to open the menu. You can find the AGILENT:G2519F-015241 track under Oligo probes in the left-hand menu, or by using the Find a track box at the top right. Turn on the track then save and close the menu. As the AGILENT:G2519F-015241 track is stranded, it appears at the top and bottom of the view.

    There are 5 probes mapped to this region on the positive strand and one probe on the reverse strand.

  3. Drag a box around the reverse strand probe then click on Mark region to highlight.

    The highlighted region maps to two transcripts: Os01t0107900-02 and Os01t0107900-01

Exploring a genomic region in human

Go to Ensembl.

  1. Go to the region from 32,264,000 to 32,492,000 bp on human chromosome 13. On which cytogenetic band is this region located? How many contigs make up this portion of the assembly (contigs are contiguous stretches of DNA sequence that have been assembled solely based on direct sequencing information)?

  2. Zoom in on the BRCA2 gene.

  3. Configure this page to turn on the LTR (repeat) track in this view. What tool was used to annotate the LTRs according to the track information? How many LTRs can you see within the BRCA2 gene? Do any overlap exons?

  4. Create a Share link for this display. Email it to your neighbour. Open the link they sent you and compare. If there are differences, can you work out why?

  5. Export the genomic sequence of the region you are looking at in FASTA format.

  6. Turn off all tracks you added to the Region in detail page.

  1. Go to the Ensembl homepage, select Human from the Species drop-down list and type 13:32264000-32492000 in the text box (alternatively leave the Search drop-down list as it is and type 13:32264000-324920000 in the text box). Click Go.

    This genomic region is located on cytogenetic band q13.1. It is made up of three contigs, indicated by the alternating light and dark blue coloured bars in the Contigs track.

  2. Draw with your mouse a box encompassing the BRCA2 transcripts. Click on Jump to region in the pop-up menu.

  3. Click Configure this page in the side menu (or on the cog wheel icon in the top left hand side of the bottom image). Go into Repeats in the left-hand menu then select LTR. Click on the (i) button to find out more information.

    Repeat Masker was used to annotate LTRs onto the genome.
    Save and close the new configuration by clicking on ✓ (or anywhere outside the pop-up window). There are ten LTRs overlapping BRCA2, none of them overlap exons.

  4. Click Share this page in the side menu. Copy the URL. Get your neighbour’s email address and compose an email to them, paste the link in and send the message. When you receive the link from them, open the email and click on your link. You should be able to view the page with the new configuration and data tracks they have added to in the Location tab. You might see differences where they specified a slightly different region to you, or where they have added different tracks.

    Here is the Share link from the video answer: https://may2021.archive.ensembl.org/Homo_sapiens/Share/71a173bba78f0dbe03e48d3240424943?redirect=no;mobileredirect=no

  5. Click Export data in the side menu. Leave the default parameters as they are (FASTA sequence should already be selected). Click Next>. Click on Text. Note that the sequence has a header that provides information about the genome assembly (GRCh38), the chromosome, the start and end coordinates and the strand. For example:
    >13_dna:chromosome_chromosome:GRCh38:13:32311910:32405865:1

  6. Click Configure this page in the side menu. Click Reset configuration. Click ✓.

Exploring a genomic region in mouse

Go to the Ensembl homepage.

  1. Go to the region from 150,320,000 to 150,540,000 bp on mouse chromosome 5. How many contigs make up this portion of the assembly (contigs are contiguous stretches of DNA sequence that have been assembled solely based on direct sequencing information)?

  2. Zoom in on the Brca2 gene.

  3. Configure this page to turn on the LTR (repeat) track in this view. What tool was used to annotate the LTRs according to the track information? How many LTRs can you see within the Brca2 gene? Do any overlap exons?

  4. Create a Share link for this display. Email it to your neighbour. Open the link they sent you and compare. If there are differences, can you work out why?

  5. Export the genomic sequence of the region you are looking at in FASTA format.

  6. Turn off all tracks you added to the Region in detail page.

  1. Select Mouse from the Species search list and type 5:150320000-150540000 in the text box (or alternatively leave the Search drop-down list like it is and type mouse 5:150320000-150540000 in the text box). Click Go.

    It is made up of five contigs, indicated by the alternating light and dark blue coloured bars in the Contigs track. Note the tiny contig, AEKQ02165236.1, which splits AC084217.7 in two.

  2. Draw with your mouse a box encompassing the Brca2 transcripts. Click on Jump to region in the pop-up menu.

  3. Click Configure this page in the side menu (or on the cog wheel icon in the top left hand side of the bottom image). Go to Repeats in the left-hand menu then select LTRs (Repeats (Mouse)). Click on the (i) button to find out more information.

    Repeat Masker was used to annotate LTRs onto the genome.

    Save and close the new configuration by clicking on ✓ (or anywhere outside the pop-up window).

    There are seven LTRs overlapping Brca2, none of them overlap exons.

  4. Click Share this page in the side menu. Select the link and copy. Get your neighbour’s email address and compose an email to them, paste the link in and send the message. When you receive the link from them, open the email and click on your link. You should be able to view the page with the new configuration and data tracks they have added to in the Location tab. You might see differences where they specified a slightly different region to you, or where they have added different tracks.

  5. Click Export data in the side menu. Leave the default parameters as they are. Click Next>. Click on Text.

  6. Click Configure this page in the side menu. Click Reset configuration. Click ✓.

Genes and transcripts

Demo: Viewing genes and transcripts

You can find out lots of information about Ensembl genes and transcripts using the browser. If you’re already looking at a Region in detail view, you can click on any transcript and a pop-up menu will appear, allowing you to jump directly to that gene or transcript.

Alternatively, you can find a gene by searching for it. You can search for gene names, identifiers, or functions that might be associated with the genes.

We’re going to look at the lacZ gene Escherichia coli str. K-12 substr. MG1655 (GCA_000005845). From bacteria.ensembl.org, search for the Escherichia coli_ str. K-12 substr. MG1655 (GCA_000005845) genome. Type lacZ into the species-specific search bar and click the Go button.  
 
 

The gene tab

Click on the gene ID b0344 from the search hits. The Gene tab should open:

This page summarises the gene, including its location, name and equivalents in other databases. At the bottom of the page, a graphic shows a Region in detail view with the transcripts. We can also see the overlapping and neighbouring genes.

There are different tabs for different types of features, such as genes and transcripts. These appear side-by-side underneath the species name at the top of the page, allowing you to jump back and forth between features of interest. Each tab has its own navigation column down the left hand-side of the page, listing all the things you can see for this feature.

Gene sequence

Let’s walk through the menu for the Gene tab. Click Sequence in the left-hand panel to view the genomic sequence of the gene.

The sequence is shown in FASTA format. The FASTA header contains the genome assembly, chromosome, coordinates and strand (1 or -1). This gene is on the positive strand.

Exons are highlighted within the genomic sequence: the exon of our gene of interest and any neighbouring or overlapping genes. By default, 600 bases are shown up and downstream of the gene. We can make changes to how this sequence appears with the Configure this page button found at the left. This allows us to change the flanking regions, add line numbering and more. Click on it now.

We have changed our Flanking sequences to 200 and added Line numbering relative to the coordinate system. Save your setting by clicking the check icon at the top right-hand corner.

You can download this sequence by clicking in the Download sequence button above the sequence. This will open a dialogue box that allows you to pick between plain FASTA sequence, or sequence in rich-text format (RTF), which includes all the coloured annotations and can be opened in a word processor. If you want run a sequence analysis tool, download as FASTA sequence, whereas if you want to analyse the sequence visually, RTF is best for this. This button is available for all sequence views.

 
 
 

Gene function

To find out the protein function, have a look at gene ontology (GO) terms from the Gene Ontology consortium. There are three pages of GO terms, representing the three divisions: GO: Biological process (what the protein does)
GO: Cellular component (where the protein is)
GO: Molecular function (how it does it)

Click on GO: Biological process to see an example of the GO pages.

Here you can see the functions that have been associated with the gene. There are three-letter codes that indicate how the association was made, as well as links to the specific transcript they are linked to.  
 
 

Gene information in external databases

We also have links out to other databases which have information about our genes and may focus on other topics that we don’t cover, like the European Nucleotide Archive ENA) or the UniProt knowledge base UniProtKB. Go up the left-hand menu to External references:

 
 
 

The transcript tab

We’re now going to explore the transcript of lacZ. Click on Show transcript table underneath the gene summary at the top of the page.

Here we can see a list of all the transcripts of lacZ with their identifiers, lengths and biotypes. The lacZ gene only has one transcript. Click on the transcript ID AAC73447.

You are now in the Transcript tab for AAC73447. We can still see the gene tab so we can easily jump back. The left hand navigation column provides several options for the transcript AAC73447 - many of these are similar to the options you see in the gene tab, but not all of them. If you can’t find the thing you’re looking for, often the solution is to switch tabs.  
 
 

Transcript sequences

Click on the Exons link in the left-hand panel. This page is useful as it will give you the length of the coding sequence.

You may want to change the display (for example, to show more flanking sequences). In order to do so, click on Configure this page and change the display options accordingly.  
 
 

Transcript information in external databases

Next, follow the General identifiers link at the left. Just like the External References page in the Gene tab, this page shows links out to other databases such as InterPro, PDB, UniProtKB, and others, this time linked to the transcript or protein product, rather than the gene.

 
 
 

Protein domain information

If you’re interested in protein domains, you could click on Protein summary to view domains from different sources, such as SMART and PROSITE. These are all plotted against the transcript sequence.

Alternatively, you can go to Domains & features to see a table of the same information in a tabular format.

Exploring a bacterial gene in Clostridium sporogenes

Start in Ensembl Bacteria and select the Clostridium sporogenes (GCA_001444695) genome.

  1. What GO: biological process terms are associated with the PolC gene?

  2. Go to the transcript tab for the only transcript, OQP95999. How long is the transcript?

  3. What domains can be found in the protein product of this transcript? How many different domain prediction methods agree with each of these domains?

  1. From the Ensembl Bacteria homepage, select Clostridium sporogenes by beginning to write the species name and selecting the species from the auto-complete list. Type PolC and click on the gene ID VT92_0235670. Click on GO: biological process in the left-hand panel.

    There are two terms listed: GO:0006260, DNA replication and GO:0006261, DNA-templated DNA replication.

  2. Click on the transcript named OQP95999 or on the Transcript tab.

    OQP95999 is 4299 bp in length.

  3. Click on either Protein Summary or Domains & features in the left hand menu to see graphically or as a table respectively.

Exploring a gene in Escherichia coli

Start in Ensembl Bacteria and search for the Escherichia coli str. K-12 substr. MG1655 (GCA_000005845) genome.

  1. What GO: biological process terms are associated with the Era gene?

  2. How many different InterPro domains are found in the protein product of this gene?

  3. What is the associated UniProt ID of the transcript?

Enter part of the name into the genome search box (e.g. MG1655) and then select the correct genome to go to the species information page.

  1. Enter Era into the search box and hit Go. Click the link in the first hit to go to the era gene page. From here, click GO: Biological process in the left-hand menu.

    There are three GO IDs: GO:0000028, GO:0006468, GO:0042274 and GO:0046777.

  2. Click on the transcript ID AAC75619 in the transcript table on the Gene tab. In the Transcript tab, go to Domains & features in the left-hand panel. Count the number of unique InterPro IDs in the table.

    11 different InterPro domains are found in the protein product of Era.

  3. You can find the UniProt ID in the transcript table or under General identifiers in the left-hand panel.

    The UniProt ID is P60785.

Exploring a gene in Magnaporthe oryzae

We’re going to look at the gene ATG8 in Magnaporthe oryzae in Ensembl Fungi. This gene is involved in autophagy, and targeted silencing of this gene inhibits infection (you can find further info in Wilson and Talbot, Nature Reviews Microbiology volume 7, pages 185–195 (2009)).

  1. Find the genomic sequence of the M. oryzae ATG8 gene. How many exons does the gene have?

  2. What is the GO accession ID for this gene?

  3. How many transcripts does the gene have? Download the cDNA sequence in FASTA format to your computer.

  4. Are there any entries of the gene in external databases? If so, which ones?

  1. From the Ensembl Fungi homepage, type ATG8 into the Search for a gene search bar, click the drop-down menu and select Magnaporthe oryzae and click the Go button. Click on the gene ID **M_BR32EuGene_00029551, which will open the **Gene tab. Go to Sequence in the left-hand panel. _ATG8 exons are in red font.

    The ATG8 gene has 3 exons.

  2. Go to Ontologies: GO: Biological process in the left-hand panel.

    The accession ID is GO:0006914.

  3. Click on Show transcript table underneath the gene summary information at the top of the page.

    The gene has 1 transcript.

    Click on Sequence: cDNA in the left-hand panel. You can export the sequence by clicking the Download sequence button, which will open a pop-up menu. Select FASTA from the drop-down list and select cDNA only. You can download the sequence as is or, if you have a large sequence, you can download the compressed file.

  4. Click on External References: General identifies in the left-hand panel. You will find hyperlinks to entries in external databases under the Database identifier column.

    Yes, there are entries in the European Nucleotide Archive (ENA), INSDC, UniParc and UniProtKB.

Exploring a gene in Plasmodium falciparum

  1. Find the Plasmodium falciparum PF3D7_1145400 gene on Ensembl Protists. On which strand is this gene located? What are the coordinates of the gene?

  2. How long is its transcript (in bp)? How long is the protein it encodes? How many exons does it have?

  3. What is the Uniprot ID that maps to the translation of this transcript?

  4. What are the GO:Biological process(es) associated with PF3D7_1145400?

  1. Go to the Ensembl Protists homepage. Select Plasmodium falciparum from the species list and type PF3D7_1145400 in the search box. Click Go. Click on PF3D7_1145400 in the search results. You can find the strand orientation and coordinates in the gene Summary page.

    PF3D7_1145400 is located on the reverse strand of chromosome 11 between 1,800,544 and 1,803,550.

  2. Click on Show transcript table.

    The transcript is 2,514 base pairs and the length of the encoded protein is 837 amino acids.

    Click on the transcript ID CZT99117 in the transcript table.

    It has four exons.

  3. You can find this information in a number of places: the transcript table, External references on the Gene tab or General identifiers on the Transcript tab.

    The UniProt ID that maps to protein encoded by the PF3D7_1145400 transcript is Q8IHR4.

  4. Click on GO: Biological process in the side menu of the Gene tab.

    The PF3D7_1145400 gene is involved in mitochondrial fission.

Exploring the CCD7 gene in Arabidopsis thaliana

  1. Find the Arabidopsis thaliana CCD7 gene on Ensembl Plants. On which chromosome and which strand of the genome is this gene located?

  2. Where in the cell is the CCD7 protein located?

  3. What is the source of the assigned gene name?

  4. How many transcripts does it have? How long is its longest transcript (in bp)? How long is the protein it encodes? How many exons does it have? Are any of the exons completely or partially untranslated?

  1. Go to the Ensembl Plants homepage (http://plants.ensembl.org/). Select A. thaliana from the species list and type CCD7 in the search box. Click Go and click on the gene ID AT2G44990. You can find the strand orientation and the location under Summary in the Gene tab.

    The A. thaliana CCD7 gene is located on chromosome 2 on the forward strand.

  2. Click on GO: Cellular component in the left-hand panel.

    The protein is located in the chloroplast and plastid.

  3. Click on Summary in the side menu.

    The gene name is assigned and imported from NCBI gene (formerly Entrezgene).

  4. Click on Show transcript table.

    There are 3 transcripts. The longest one is 2005 bp and the length of the encoded protein is 622 amino acids.

    Click on the transcript ID AT2G44990.3 in the transcript table. You can find the number of exons in under in the summary information at the top of the page.

    It has 6 exons.

    Click on Sequence: Exons in the left-hand panel.

    The first and last exons are partially untranslated (sequence shown in orange). This can also been seen from the fact that in the transcript diagrams on the Gene Summary and Transcript Summary pages the boxes representing the first and last exon are partially unfilled.

Exploring the MYH9 gene in human

  1. In Ensembl, find the human MYH9 (myosin, heavy chain 9, non-muscle) gene and open the Gene tab.
    • On which chromosome and which strand of the genome is this gene located?
    • How many transcripts (splice variants) are there and how many are protein coding?
    • What is the longest protein-coding transcript, and how long is the protein it encodes?
    • Which transcript would you take forward for further study?
  2. Click on Phenotypes at the left side of the page. Are there any diseases associated with this gene, according to Mendelian Inheritance in Man (MIM)?

  3. What are some functions of MYH9 according to the Gene Ontology (GO) consortium? Have a look at the GO: Biological process pages for this gene.

  4. In the transcript table, click on the transcript ID for MYH9-201, and go to the Transcript tab.
    • How many exons does it have?
    • Are any of the exons completely or partially untranslated?
    • Is there an associated sequence in UniProtKB/Swiss-Prot? Have a look at the General identifiers for this transcript.
  5. Are there microarray (oligo) probes that can be used to monitor ENST00000216181 expression?
  1. Select Human from the Species drop-down list and type MYH9. Click Go. Click on MYH9 (Human Gene) in the search results which will send you to the Gene tab.
    • The gene is located on chromosome 22 on the reverse strand.
    • Ensembl has 23 transcripts annotated for this gene, of which 6 are protein-coding.
    • The longest protein-coding transcript is MYH9-215 and it codes for a protein that is 1,981 amino acids long.
    • MYH9-201 is the transcript I would take forward for further study, as it is the MANE Select transcript (for a description, mouse-over the MANE Select flag in the transcript table).
  2. Click on Phenotypes in the left-hand panel to see the associated phenotypes. There is a large table of phenotypes. To see only the ones from MIM, type MIM into the filter box at the top right-hand corner of the table.

    These are some of the phenotypes associated with MYH9 according to MIM: Deafness, Autosomal dominant 17 and Macrothrombocytopenia and granulocyte inclusions with or without nephritis or sensorineural hearing loss. You can click on the records for more information.

  3. The Gene Ontology project maps terms to a protein in three classes: biological process, cellular component, and molecular function. Click on GO: Biological process on the left-hand panel. Angiogenesis, cell adhesion, and protein transport are some of the roles associated with MYH9. All GO terms are associated with a single transcript: ENST00000216181.

  4. Click on ENST00000216181.11 in the transcript table. You should now be on the Transcript tab.
    • It has 41 exons, shown in the Transcript summary.

    Click on the Exons link in the left-hand panel.

    • Exon 1 is completely untranslated, and exons 2 and 41 are partially untranslated (UTR sequence is shown in orange). You can also see this in the cDNA view if you click on the cDNA link in the left side menu.

    Click on General identifiers in the left-hand panel.

    • P35579.247 from UniProt/Swiss-Prot matches the translation of the Ensembl transcript. Click on P35579.247 to go to UniProtKB, or click align for the alignment.
  5. Click on Oligo probes in the left-hand panel.

    Probesets from Affymetrix, Agilent, Codelink, Illumina, and Phalanx OneArray match to this transcript sequence. Expression analysis with any of these probesets would reveal information about the transcript. Hint: this information can sometimes be found in the [ArrayExpress Atlas] (https://www.ebi.ac.uk/biostudies/arrayexpress).

Exploring the Dpp6 gene in mouse

Genetic variation in the dipeptidylpeptidase 6 Gene (DPP6) in humans has previously been strongly associated with amyotrophic lateral sclerosis (ALS), a lethal disorder caused by progressive degeneration of motor neurons in the brain.

  1. Go to the Ensembl homepage, search for the Dpp6 gene in mouse and click on the transcript ID ENSMUST00000071500 to open the transcript tab. How many exons make up this transcript?

  2. Click on Exons to display the exon sequences of the transcript. Which exon contains the translation start? What is the exon ID of the largest exon? What is the start and end phase of exon 2?

  3. Go to the Protein summary. How many protein domains or features fall within the second exon? What is the Pfam protein domain at the C-terminus of the protein and how many exons does it fall into? Which amino acid positions does the domain above cover?

  4. Go to Domains and features. Which domains are associated with Pfam? How many genes in the mouse genome have the IPR002469 domain? What chromosomes are these genes found on?

  1. Select Mouse from the Species search drop-down and type Dpp6 and click Go. Click on Dpp6-201 (Mouse Transcript, Strain: reference (CL57BL6)) in the results.

    ENSMUST00000071500.13 consists of 26 exons.

  2. Click on Exons in the left-hand panel. The translation start is found in the first exon (ENSMUSE00000725552), shown in dark blue text.

    The largest exon is the final exon (856 bp), which has the exon ID ENSMUSE00000773588. Exon 2 has a start and end phase of 0 and 1 respectively, which means that the codon at the start of the exon starts at the first nucleotide and the codon at the end of the exon ends at nucleotide 2. Notice that the end phase of each exon is the same as the start phase of the next exon.

  3. Click on Protein summary in the menu on the left hand side of the page. Alternating exons are shown on the protein as different shades of purple.

    There are two predicted protein domains that fall within the second exon: a transmembrane helix and low complexity peptide sequence (Seg). You can click on the track names to find a description.

    Click on a domain or feature to view further information.

    The C-terminal Pfam domain is Peptidase_S9 (PF00326), which spans or partially spans seven exons, covering amino acid positions 582-787.

  4. Click on Domains & features.

    Looking at the domains table you should notice that there are two domains associated with Pfam: PF00326 and PF00930.

    Click on Display all genes with this domain next to IPR002469. This should now display the genes that have the IPR002469 domain located on the karyotype and as a table.

    6 genes have this domain and they are found on chromosomes 1, 2, 5, 9 and 17.

Variation

Demo: The gene tab

View all variants within a gene sequence

In any of the sequence views shown in the Gene and Transcript tabs, you can view variants on the sequence. You can do this by clicking on Configure this page from any of these views.

Let’s take a look at the Gene sequence view for ADH4 in Saccharomyces cerevisiae R64-1-1. Search for ADH4 and go to the Sequence view. If you can’t see variants marked on this view, click on Configure this page and select Show variants: Yes and show links.

Find out more about a variant by clicking on it.

You can add variants to all other sequence views in the same way.

You can go to the Variation tab by clicking on the variant ID. For now, we’ll explore more ways of finding variants.  
 
 

View all variants within a gene in tabular format

To view all the sequence variations in table form, click the Variant table link at the left of the gene tab

You can filter the table to only show the variants you’re interested in. For example, click on Consequences: All, then select the variant consequences you’re interested in.

You can also filter by other columns such as, SIFT or Source.  
 
 

Demo: The location tab

Visualise variants within a region

Let’s have a look at variants in the Location tab. Click on the Location tab in the top bar. Click Configure this page and open Variation from the left-hand menu.

There are various options for turning on variants. Turn on the Sequence variants (all sources) in Normal.

Click on a variant to find out more information. It may be easier to see the individual variants if you zoom in.  
 
 

Demo: The variant tab

Variant summary

Let’s have a look at a specific variant. The easiest way to find a specific variant is to search for it. Search for s07-16069 and click through to the Variant tab.

 
 
 

Variant consequences specific features

The icons show you what information is available for this variant. Click on Genes and regulation, or follow the link in the left-hand panel.

This variant is found in the YGL256W gene only.  
 
 

Genotype frequency

Let’s look at population genetics. Either click on Explore this variant in the left hand menu then click on the Genotype frequency icon, or click on Genotype frequency in the left-hand menu.

Individual sample genotypes

We can see which strains these genotypes were observed in by going to Sample Genotypes.

Variation data in Fusarium oxysporum

  1. How many species in Ensembl Fungi have variation data?

  2. Select Fusarium oxysporum (FO2) and search for the FOXG_13574T0 gene. One of its upstream variants is SNP tmp_10_6610. What are the possible alleles for this polymorphic position? Which one is on the reference genome?

  3. What is the most frequent allele at this position?

  4. Which samples have the genotypes C|T and T|T?

  1. Go to Ensembl Fungi, click on View full list of all species. You can sort the table by column. Click on the Variation database column to sort the table by species with variation data.

    The table shows that we have 8 fungi species currently with variation databases.

  2. Click on Fusarium oxysporum in the table and on the species page search for FOXG_13574T0. From the Gene tab, click on Variant table in the left-hand panel. You can use the filter at the top right-hand corner of the table tmp_10_6610.

    The alleles are C/T, where C is the reference allele.

  3. Click on tmp_10_6610 in the table to open the Variant tab. Then click on Genotype frequency from the menu on the left-hand side of the page.

    The most frequent allele at this position is C with a frequency of 0.850.

  4. Click on Sample genotypes in the menu on the left.

    The table shows that sample 909454 has the C|T genotype and 909455 has the T|T genotype.

Variation data in tomato

  1. Go to Ensembl Plants and find the Solyc02g084570.3 gene in Solanum lycopersicum (tomato) and go to its Location tab. Can you see the variation track?

  2. Zoom in around the last exon of this gene. What are the different types of variants seen in that region? Are any splice region variants mapped in the region? If so, what is/are the coordinate(s)?

  1. Select Solanum lycopersicum from the Species search drop-down menu and search for Solyc02g084570.3. In the results page, you can click on the coordinates 2:48284598-48288482 to go straight to the Location tab. Scroll down to the Region in detail view. The variation track is shown at the bottom of the view.

    If you don’t see the Variation - All sources track, click Configure this page on the left-hand panel, search for the track in the pop-up menu and enable the track by clicking on the square next to the track name. Close the pop-up window and wait for the track to load.

  2. Zoom in around the last exon of this gene by drawing a box in the respective region (you can change your mouse action by clicking the Drag/Select icons at the top right-hand corner of the view). Note the gene is on the reverse strand (this is signified by the < sign next to the transcript name, and it is located below the Contigs track), so the last exon will be on the left hand side of that image. The variation legend is shown at the bottom of the page, telling you what the colours mean.

    The types of variants seen in that region are 3’ UTR, missense, synonymous and splice region variants.

    Splice region variants are shown in orange. Click on the variants to get additional information on that variant including location. You can zoom into the region if the variant block is too small to click.

    The variants are found at 2:48285642 and 2:48285640-48285641. Note that the two variants overlap: one is a SNP and the other is an indel. SNPs are tagged with ambiguity codes (zoom into the region if you cannot see this). You can find a useful IUPAC ambiguity code guid on the bioinformatics.org website. Single-letter ambiguity codes are given when two or more possible nucleotides may be represented at a single base locus.

Human population genetics and phenotype data

The SNP rs1738074 in the 5’ UTR of the human TAGAP gene has been identified as a genetic risk factor for a few diseases. Use Ensembl to answer the following questions:

  1. In which transcripts is this SNP found?

  2. What is the least frequent genotype for this SNP in the Yoruba (YRI) population from the 1000 Genomes phase 3?

  3. What is the ancestral allele? Is it conserved in the 91 eutherian mammals EPO-Extended?

  4. With which diseases is this SNP associated? Are there any known risk (or associated) alleles?

  1. Please note there is more than one way to get this answer. Either go to the Variation table of the human TAGAP gene, and use the Consequence filter to only include 5’UTR variants, or search Ensembl for rs1738074 directly. Once you’re in the Variant tab, click on Genes and regulation in the menu.

    This SNP is found in four transcripts of TAGAP. It is also intronic to eleven non-coding transcripts of TAGAP-AS1 and one non-coding transcript of ENSG00000226032.

  2. Click on Population genetics in the left-hand panel, or click on Explore this variant in the left-hand panel and click the Population genetics icon.

    In Yoruba (YRI), the least frequent genotype is CC at the frequency of 5.6%.

  3. Click on Phylogenetic context in the left-hand panel.

    The ancestral allele is T and it’s inferred from the alignment in primates.

    Click on Select an alignment which will open a pop-up menu. Open Multiple alignments and select 91 eutherian mammals EPO-Extended. Click on Apply at the bottom of the menu to save your settings.

    A region containing the SNP (highlighted in red and placed in the centre) and its flanking sequence are displayed. The T allele is conserved in all but two of the eutherian mammals displayed.

  4. Click Phenotype data in the left-hand panel.

    This variation is associated with multiple sclerosis, celiac disease and white blood cell count. There are known risk alleles for all three diseases and the corresponding P values are provided. The allele A is associated with celiac disease. Note that the alleles reported by Ensembl are T/C. Ensembl reports alleles on the forward strand. This suggests that A was reported on the reverse strand in the original paper. Similarly, one of the alleles reported for Multiple sclerosis is G.

Exploring a SNP in mouse

In the paper “Altered metabolic signature in pre-diabetic NOD mice” (PloS One. 2012; 7(4): e35445), Madsen et al. have described several regulatory and coding SNPs, some of them in genes involved in ATP and adenosine metabolism, leading to potentially faulty metabolism of ATP and adenosine. The authors describe that one of the identified SNPs in the murine Entpd2 gene (rs28232063) would lead to increased amounts of available ATP, an immune activator, causing increased cell activation and possibly autoreactive T-cell activation. Use Ensembl to answer the following questions:

  1. Where is the SNP located (chromosome and coordinates)?

  2. What is the HGVS recommendation nomenclature for this SNP?

  3. Why does Ensembl put the G allele first (G/A)?

  4. Are there differences between the genotypes reported in C57BL/6NJ and NOD/ShiLtJ, according to the Mouse Genomes Project?

  1. From the Ensembl homepage, select Mouse from the Species search drop-down and enter rs28232063 in the search box.

    SNP rs28232063 is located on 2:25288362. In Ensembl, its alleles are provided relative to the forward strand.

  2. Click on Show under HGVS names to reveal information about HGVS nomenclature.

    This SNP has got four HGVS names, one at the genomic DNA level (NC_000068.8:g.25288362G>A), two at the transcript level (ENSMUST00000148859.2:n.444-182G>A and ENSMUST00000028328.3:c.446G>A) and one at the protein level (ENSMUSP00000028328.3:p.Arg149Gln).

  3. In Ensembl, the allele that is present in the reference genome assembly is always put first.

    G is the allele for the reference mouse genome strain C57BL/6J

  4. Click on Sample genotypes is the left-hand panel. The table shows genotypes reported for different mouse strains from the Mouse Genomes Project.

    There are indeed differences between the genotypes reported in those two different strains. The genotype reported in C57BL/6NJ is G/G whereas in NOD/ShiLtJ the genotype is A/A.

VEP

Demo: VEP

Input

We have identified 5 variants on Saccharomyces cerevisiae R64-1-1 chromosome VII:
T -> C at 3598
G -> C at 3929
T -> G at 5566
T -> A at 5727
A -> T at 7628

We will use Ensembl VEP to determine:

  • Have my variants already been annotated in Ensembl?
  • What genes are affected by my variants?
  • Are any of my variants missense variants?

Click on Tools in the navigation bar at the top of any Ensembl Fungi page, then click Variant Effect Predictor to open the input form:

Click on Add/remove species and search for Saccharomyces cerevisiae R64-1-1 to select it.

First, we need to convert our variants into one of the formats supported by VEP. You can find a list of input formats and examples in the VEP documentation page. Let’s put our data in VCF:
chromosome coordinate id reference alternative

Copy and paste the following data into the Input data box:

VII 3598 var1 T C
VII 3929 var2 G C
VII 5566 var3 T G
VII 5727 var4 T A
VII 7628 var5 A T

VEP will automatically detect that the data is in VCF.  
 
 

Additional configurations

There are further options that you can choose for your output. These are categorised as Identifiers, Variants and frequency data, Additional annotations, Predictions, Filtering options and Advanced options. You can open all tabs to explore the different options.

Hover over the options to see definitions. When you’ve selected everything you need, scroll right to the bottom and click Run.

The display will show you the status of your job. It will say Queued, then automatically switch to Done when the job is done, you do not need to refresh the page. You can edit or discard your job at this time. If you have submitted multiple jobs, they will all appear here.

Click View results once your job is done.  
 
 

Results

In your results you will see a graphical summary of your data, as well as a table of your results.

The results table is very detailed, so we’re going to go through the it by section. The first column is Uploaded variant. If your input data contains IDs, like ours does (i.e. var1, var2, etc.), the ID is listed here. If your input data is only loci, this column will contain the locus and alleles of the variant. You’ll notice that the variants are not neccessarily in the order they were in in your input. You’ll also see that there are multiple lines in the table for each variant, with each line representing one transcript or other feature the variant affects.

You can hover over any column name with your mouse to get a definition of what is shown.

The next few columns show information about the feature the variant affects, including the consequence. Where the feature is a transcript, you will see the gene symbol and stable ID and the transcript stable ID and biotype. The IDs are links to take you to the Gene or Transcript tabs.

This is followed by details on the effects on transcripts, including the position of the variant in terms of the exon number, cDNA, CDS and protein, the amino acid and codon change and pathogenicity scores. Where the variant is known, the ID of the existing variant is listed under the column Existing variant, with a link out to the Variant tab. The pathogenicity scores are shown as numbers with coloured highlights to indicate the prediction, and you can mouse-over the scores to get the prediction in words.

Above the table is the Filter option, which allows you to filter by any column in the table. You can select a column from the drop-down, then a logic option from the next drop-down, then type in your filter to the following box. We’ll try a filter of Consequence, followed by is then missense_variant, which will give us only variants that change the amino acid sequence of the protein. You’ll notice that as you type missense_variant, the VEP will make suggestions for an autocomplete.

You can export your VEP results in various formats, including VCF. When you export as VCF, you’ll get all the VEP annotation listed under CSQ in the INFO column of the VCF file. After filtering your data, you’ll see that you have the option to export only the filtered data.

VEP in Ensembl Bacteria

In Ensembl Bacteria the genome for Bacteroides fragilis 638R and launch the VEP tool. Use VEP to predict the effects of a 7 bp deletion of TCTACAA on the supercontig FQ312004 at the position 258140-258146. Use the results to answer the following questions:

  1. How many genes does the indel overlap? What are their gene symbols?

  2. What is the most common consequence of the variant?

  3. What is the most severe consequence? What gene does it affect and what does it do?

Type Bacteroides fragilis 638R into the species search box, then select the genome. You are now in the species information page. Click on Variant Effect Predictor at the bottom left. Next, you want to make sure your variant is in one of VEP’s supported variant formats. We have converted the variant into the Ensembl default VEP format. You can enter the following into the input box: FQ312004 258140 258146 TCTACAA/- +

Make sure you name your VEP job something descriptive, so it’s easier for you to find later on. Click Run to get the results.

  1. Find the number of overlapped genes in the Summary statistics. You can find their gene symbols under the Symbol column in the table below.

    The indel overlaps 14 different genes. 12 have the following symbols assigned to them: traA, traD, traE, traF, traG, traI, traJ, traK, traL, traM, traN and traO. 2 genes do not have a gene symbol.

  2. Sort the table by Consequence by clicking on the column name.

    The most common consequences are downstream_gene_variant (n=6) and upstream_gene_variant (n=6).

  3. You can find a list of calculated consequences sorted by severity in the [Ensembl Variation documentation](https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html#consequences).

    According to the calculated variant consequence list, the most severe consequence is frameshift_variant on the traI gene. The gene is a putative conjugative transposon protein traI.

VEP analysis of variants in Verticillium dahliae

We have identified four variants in Verticillium dahliae JR2 chromosome 5:

  • C->G at 698711
  • G->T at 698935
  • G->A at 700313
  • C->A at 701484

Use VEP in Ensembl Fungi to answer the following questions:

  1. Have these variants already been annotated in Ensembl?

  2. What genes are affected by the variants? What are their gene IDs?

  3. Are any of the variants predicted to be missense variants?

Go to any Ensembl Fungi page and click on Tools in the navigation bar at the top of the page. Click on Variant Effect Predictor and change your species to Verticillium dahliae JR2e by clicking on Change species.

Enter a descriptive name for your VEP job. You will need to convert your variants into one of VEP’s supported input formats. We have converted the variants into the Ensembl default format below. Paste the variants into Input data:.

5 698711 698711 C/G
5 698935 698935 G/T
5 700313 700313 G/A
5 701484 701484 C/A

Click Run at the bottom of the page. When your job is done, click View reesults.

  1. You can find the number of existing and novel variants in the Summary statistics of the results.

    4 variants were analysed, of which 3 are novel.

  2. You can also find the number of overlapped genes in the Summary statistics.

    4 genes are affected.

    Sort the table by Gene by clicking on the column name. Count the number of unique gene IDs.

    The gene IDs are: VDAG_JR2_Chr5g02150a, VDAG_JR2_Chr5g02160a, VDAG_JR2_Chr5g02170a and VDAG_JR2_Chr5g02171a.

  3. Filter the table as follows: Consequence is missense_variant.

    Yes, the third variant (5_700313_G/A) is predicted to have a missense effect on gene VDAG_JR2_Chr5g02170a.

Web VEP analysis of variants in Oryza sativa Japonica (rice)

You’ll find a VCF file here. This is a small subset of the outcome of Oryza sativa Japonica whole-genome sequencing and variant-calling experiment. Analyse the variants in this file with the VEP tool in Ensembl Plants and determine the following:

  1. How many genes and transcripts are affected by variants in this file?

  2. Do these variants result in a change in the proteins encoded by any of the Ensembl genes? Which genes are affected? What is the amino acid change?

Go to Ensembl Plants and click on Tools at the top of the page. Click on Variant Effect Predictor and select Oryza sativa Japonica Group from the Species menu.

Either click on Choose file and select the file to upload it, or directly paste the URL into the Or provide file URL: box. Click Run at the bottom of the page. When your job is done, click View results.

  1. The number of affected genes and transcripts is shown in the Summary statistics table at the top.

    8 genes and 8 transcripts are affected by these variants.

  2. Use the filters to view only missense variants. The filters are found above the detailed results table in the middle. Select Consequence and is from the drop-down menus. Then type missense_variant into the boxe. Add to apply your filter.

    1 variant is a missense variant. It causes a leucine to arginine (L/R) change in the gene OS09G0103500.

VEP analysis of FLG variants in human

Below, you will find a list of variants which have been reported in the FLG gene (ENSG00000143631), which codes for filaggrin. Mutations in the gene have been associated with atopic dermatitis.

  • C/A at 1: 152,302,797
  • G/A at 1: 152,307,085
  • C/T at 1: 152,310,208
  • G/A at 1: 152,308,234
  • C/T at 1: 152,313,454
  • C/T at 1: 152,309,920
  • A/G at 1: 152,309,268

Use the VEP tool in Ensembl and find out the following information:

  1. What consequences have been predicted for the variants?

  2. Are SIFT and PolyPhen predictions available for the variants? What are the scores?

  3. How many of these variants have been previously reported?

Go to the Ensembl homepage and click on the link VEP at the top of the page. Launch the VEP web interface. You will need to convert your variants into one of VEP’s supported input formats. We have converted the variants into the Ensembl default format below. Paste the variants into Input data:.

1  152302797 152302797 C/A
1  152307085 152307085 G/A
1  152310208 152310208 C/T
1  152308234 152308234 G/A
1  152313454 152313454 C/T
1  152309920 152309920 C/T
1  152309268 152309268 A/G

Enable SIFT and PolyPhen score predictions in the Predictions under Additional configurations and click Run. Once your job is done, click View results. You will retrieve a table of your VEP results. Note that there may be multiple entries in the table for each variant. VEP will give you all consequences for each feature the variant falls within.

  1. Filter your table to show variants affecting the FLG gene (ENSG00000143631) only. Under Filters select Gene from the drop-down menu and enter ENSG00000143631 in the text field. Click Add.

    Look for the Consequence column. This column will give you the sequence ontology (SO) terms (i.e. synonymous, missense, downstream, intronic, 5’ UTR, 3’ UTR, etc.) provided by VEP for the listed SNPs. You can read more about the Sequence Ontology project here. All 7 entries have been predicted to have missense consequences.

  2. Look for the SIFT and PolyPhen columns.

    Pathogenicty scores are available for all variants. Benign/tolerated scores are predicted for variants 1_152313454_C/T, 1_152310208_C/T and 1_152309920_C/T. For the other 4 variants, the SIFT and PolyPhen scores do not correspond.

  3. Look for the Existing variant column. If a variant has been previously described, a link will be available in this column.

    4 variants have been previously described: 1_152313454_C/T, 1_152308234_G/A, 1_152310208_C/T and 1_152309920_C/T. Variants 1_152302797_C/A, 1_152309268_A/G and 1_152307085_G/A are novel variants.

VEP analysis of structural variants in human

We have details of a genomic deletion in a breast cancer sample in VCF format:

13 32307062 sv1 . <DEL> . . SVTYPE=DEL;END=32908738

Use VEP in Ensembl to find out the following information:

1.  How many genes have been affected?

2.  Does the structural variant (SV) cause deletion of any complete transcripts?

3.  Map your variant in the Ensembl browser on the Region in detail view.

  1. Click on VEP at the top of any Ensembl page and open the web interface. Make sure your species is Human. It is good practise to name your VEP jobs something descriptive, such as Patient deletion exercise. Paste the variant in VCF format into the Paste data field and hit Run.

    12 different genes are affected by the SV.

  2. Filter your table by select Consequence is transcript_ablation at the top of the table and click Add.

    Yes, there is deletion of complete transcripts of PDS5B, N4BP2L1, BRCA2, RNY1P4, IFIT1P1, ATP8A2P2, N4BP2L2, N4BP2L2-IT2 and one gene without official symbols: ENSG00000212293.

  3. To view your variant in the browser click on the location link in the results table 13: 32307062-32908738. The link will open the Region in detail view in a new tab. If you have given your data a name it will appear automatically in red. If not, you may need to Configure this page and add it under the Personal data tab in the pop-up menu.

Comparative genomics

Demo: gene trees and homology predictions

Fungal Compara

Gene trees

Let’s look at the homologues of Saccharomyces cerevisiae R64-1-1 TAZ1 (gene stable ID: YPR140W). This gene is involved in stress response and conserved across different taxonomic domains. Search for the gene and go to the Gene tab.

Click on Fungal Compara: Gene tree. This will display the current gene in the context of a phylogenetic tree used to determine orthologues and paralogues.

Funnels indicate collapsed nodes. Click on a node (coloured triangle) to open a menu. We can then see what type of node this is, some statistics and options to expand or export the sub-tree.

There are some quick filtering options below the image, where you can add paralogues, and quickly expand or collapse nodes.

You can download the tree in a variety of formats. From the pop-up above you can click to export the sub-tree (everything to the right of the node). Alternatively, click on the Export icon in the bar at the top of the image to get a pop-up where you can choose your format. You can preview this file before you download.

 
 
 

Homologues

We can look at homologues in the Orthologues and Paralogues pages, which can be accessed from the left-hand menu. If there are no orthologues or paralogues, then the name will be greyed out. Click on Fungal Compara: Orthologues to see the orthologues available. In the first table, you will find a summary of orthologues by taxonomic group:

In the second table, you will find orthologue details per species:

Scroll to the bottom of the page to see a list of the species that do not have any orthologues with TAZ1 in S. cerevisiae… there are a lot!

 
 
 

Pan-taxonomic Compara

S. cerevisiae is part of Pan-compara, which compares a subset of fungal species with species from other taxa, such as plants, bacteria and vertebrates. Click on Pan-taxonomic Compara: Orthologues. Let’s see if there are any orthologues of TAZ1 in plants. Click the Show details box for Plants.

Demo: Whole-genome alignments

Alignments in the Region in Detail view

Let’s look at some of the comparative genomics views in the Location tab. Go to the region 2:438432-458358 in Fusarium solani. This region includes a number of genes and we want to find out if any regions align with Fusarium verticillioides. To do this, we need to look at a pairwise alignment between the two Fusarium species. We can look at individual species comparative genomics tracks in this view by clicking on Configure this page. In the Comparative genomics section, turn on the F. verticillioides track in the normal format.

We can now see some pink alignments shown on the display. Alignments to the same chromosome are presented in a single row, and gaps in the alignment are shown by empty blocks. If there are alignments to multiple chromosomes in the aligned species, they are represented on different rows. From the track, we can see that the regions encoding genes NechaG90679, NechaG57836, NechaG26181, NechaG32512 and NechaG75968 align perfectly between the two genomes.

 
 
 

Sequence alignments

We can also look at the sequence alignment between the two species as text. Click on Comparative Genomics: Alignments (text) in the left-hand menu.

Click on Select an alignment to select a species you want to align. Let’s select F. solani and click Go. There are two blocks aligned of different lengths, some of which correspond to the region we just saw in the Region in detail view.

Click on Block 1.

You will see a list of aligned regions, followed by the sequence alignment. Click on Display full alignment. Exons are shown in red. Click on Configure this page on the left. In the pop-up menu, you can turn on the options to view Show conservation regions and Mark alignment start/end. This will add highlights where the sequence matches.  
 
 

Region comparison

To compare the two genomic regions visually, go to Region Comparison in the left-hand panel. To add species to this view, click on the Select species or regions button. Select F. verticillioides again then close the menu. This page, similar to the Region in detail view, shows the chromosome positions first. We can see the location of this alignment on chromosome 2 in F. solani. You can scroll down to the most detailed image to view aligned regions, which are highlighted and linked in green.

You can add data to both of these views with the same options you had in the Region in detail page. Click on Configure this page to open the menu.  
 
 

Synteny

We can view large-scale syntenic regions from our chromosome of interest. Click on Synteny in the left-hand panel. Black linking lines indicate sequences are oriented in the same directed, red linking lines indicate the sequences are inverted.

Pan-taxonomic comparative genomics data in Ensembl Bacteria

Bacillus subtilis subsp. subtilis str. 168 (GCA_000009045) is a model organism and often used in academic research and in the biotechnology industry as it can produce large amounts of important enzymes, like protease and amylase. It is part of Pan-taxonomic Compara in Ensembl Bacteria. We will use the sipT gene, a type I signal peptidase, as a reference to find the following information:

  1. Find the Ensembl gene tree ID. How many speciation and duplication nodes does it have?

  2. How many orthologues does B. subtilis str. 168 have? What type of orthologues are they?

  3. Does it have an orthologue in Escherichia coli str. K-12 substr. MG1655? If so, what is the gene ID and coordinate in E. coli?

  4. Export the protein alignment of the B. subtilis and E. coli orthologues. What are the different formats you can export the alignment as?

  1. Go to the Ensembl Bacteria homepage and enter Bacillus subtilis subsp. subtilis str. 168 (GCA_000009045) in the Species search bar. In the species information page, enter sipT. Click the gene ID BSU_14410. Under the Gene tab, click on Pan-taxonomic Compara: Gene Tree on the left.

    The Ensembl genetree ID is EGGT00050000013001. There are 92 speciation nodes and 35 duplication nodes.

  2. Go to Pan-taxonomic Compara: Orthologues on the left-hand panel. You can find the number and types of orthologues under the Summary of orthologues of this gene table.

    The B. subtilis str. 168 sipT gene has 56 1-to-many and 15 many-to-many orthologues.

  3. Filter the Selected orthologues table by entering _Escherichia coli_ str. K-12 substr. MG1655 in the search bar in the top right-hand corner of the table.

    Yes, an orthologue is present in E. coli str. K-12 substr. MG1655. The gene ID is b2568 and the its coordinate is 2,704,335-2,705,309.

  4. Click on View Sequence Alignment in the Orthologue column. Select View Protein Alignment from the pop-up menu. Click on the Download homology button.

    Depending on your downstream analyses you may choose to export the alignment in a particular format. In Ensembl, you can export the alignment in the following formats: ClustalW, FASTA, Mega, MSF, Nexus, OrthoXML, Pfam, Phylip, PhyloXML, PSI and Stockholm.

Orthologues of the Schizosaccharomyces pombe Mcm6 gene

Go to the Ensembl Fungi site to find out the following:

  1. How many orthologues and how many paralogues are predicted for the Schizosaccharomyces pombe Mcm6 gene?

  2. How many Schizosaccharomyces orthologues are there?

  3. Does it have a human orthologue? If so what type is it and what is the orthologue’s Ensembl ID?

  1. From the Ensembl Fungi homepage, select Schizosaccharomyces pombe from the Favourite genomes section and search for Mcm6 in the species information page. Alternatively, you can select S. pombe from the Species search drop-down and enter Mcm6. In the search results, click on the gene ID SPBC211.04c to open the Gene tab. Click on Fungal Compara: Orthologues in the left-hand menu to see all the orthologous genes. You can find the number of orthologues and paralogues of the gene in the summary information at the top of the page.

    Mcm6 has 387 orthologues and 4 paralogues.

  2. Use the filter in the top right-hand corner of the table to search for Schizosaccharomyces.

    There are 3 Schizosaccharomyces orthologues: Schizosaccharomyces cryophilus, Schizosaccharomyces japonicus and Schizosaccharomyces octosporus.

  3. Go to Pan-taxonomic Compara, hide the summary table and search for human in the Orthologues table.

    Yes, it is a 1:1 orthologue to the human FH gene (ENSG00000091483).

Finding orthologues and gene trees of the Arabidopsis thaliana FUM1 gene

The fumarase gene FUM1 in Arabidopsis thaliana encodes a protein with mitochondrial targeting information. Read more in this UniProt entry. Go to Ensembl Plants to answer the following questions:

  1. How many orthologues have been identified for this gene?

  2. Which orthologue has the highest sequence similarity? Look at the Query%ID and Target%ID.

  1. Go to Ensembl Plants, select Arabidopsis thaliana from the Favourite genomes section on the homepage. Search for FUM1. Click on the gene ID AT2G47510. Now click on Plant Compara: Orthologues on the left-hand panel to see all orthologues of this gene. You can find the number of orthologues in the summary information at the top of the page.

    FUM1 has 166 orthologues in Ensembl Plants.

  2. Click on the triangles in the table column headers to sort by identity. If you are unsure of what data the column is show, you can mouse-over the headers for a description.

    The orthologue with the highest sequence similarity is from Arabidopsis halleri.

Finding orthologous genes for a root transporter in Oryza sativa Japonica (rice)

Search Ensembl Plants for the gene Lsi1 in Oryza sativa Japonica (rice). This gene is known to code for an aquaporin transporter that facilitates the uptake of silicon and arsenic through the roots. Silicon concentration is highest in grass species, and is associated with defence.

  1. From the gene tab, go to the Orthologues page under Plant Compara. Which plant group has the highest number of 1-to-1 orthologues? Is it the same group that has the highest number of 1-to-many orthologues?

  2. Reduce the orthologues table to look only at Triticum aestivum (wheat) orthologues. Why are there three results for a 1-to-1 orthologue?

  3. Click on the Compare regions link for chromosome 6B region in wheat to go to the Location tab. Scroll to the bottom image. How do the gene models compare between the species? Do they have the same number of exons?

  4. Click back to the Gene tab and click on the Gene gain/loss tree page. Which species has the highest number of members of this gene family? Is it a grass? Can you change the view to see a radial tree?

Go to Ensembl Plants. Look for the main search box highlighted in green. Select Oryza sativa Japonica Group from the drop-down box and type in Lsi1. Click Go and click on the gene ID Os02g0745100.

  1. Go to Plant Compara: Orthologues on the left-hand panel.

    Liliopsida has 24 1-to-1 orthologues, the only group with 1-to-1 orthologues. This group is synonymous with Monocotyledon, so the group that contains the grasses. Eudicotyledons has the highest number of 1-to-many orthologues, indicating that this gene has been duplicated in the eudicots.

  2. Use the search box in the top right-hand corner of the Selected orthologues table and enter Triticum aestivum, the table should automatically filter.

    There are 3 results, one for each component (A,B,D). Note that these are considered 1-to-1 orthologues, rather than 1-to-many. This is because these genes arose in wheat by hybridisation (allopolyploidy), rather than duplication (autopolyploidy).

  3. Click on Compare regions (found in the 3rd column below the gene identifier) from the 2nd result for component 6B. This takes us to the Location tab. Scroll down to the bottom of the page.

    Both genes have 5 exons and the same structure. This looks unusual because the gene in rice is on the forward strand, while the gene in wheat is on the reverse strand. This is reflected in the crossing green links between the pink alignment blocks.

  4. Click on the Gene tab at the top of the page and click on Gene gain/loss tree in the left-hand panel.

    Significant expansions are shown with red branches, and the number of genes in the family shown in the count next to the image and species name. We can see that Brassica napus has 22 members in this group.

We can change the tree to radial view by clicking on the icon with two arrows at the top left of the image.

Orthologues and gene trees for the human BRAF gene

Go to Ensembl to answer the following questions:

  1. How many orthologues are predicted for the human BRAF in primates? How much sequence identity does the Carlito syrichta (tarsier) protein have to the human one? Can you tell which end of the BRAF protein is more conserved between these two species by looking at the orthologue alignment?

  2. Go to the Gene tree for this gene. View the Wasabi alignment of all the proteins in primates. Can you see a large gap in the alignment around position 450? Which species match the human sequence?

  1. From the Ensembl homepage, choose Human from the drop-down list and search for BRAF. Click through to the Gene tab view. Click on Orthologues at the left side of the page to see all the orthologous genes.

    There are 1:1 orthologues in 22 primates reported in the summary table.

    Search for Tarsier in the table below.

    The percentage of identical amino acids in the tarsier protein (the orthologue) compared with the gene of interest. i.e. human BRAF (the target species/gene) is 95.39%. This is known as the Target%id. The identity of the gene of interest (human BRAF) when compared with the orthologue (tarsier BRAF, the query species/gene) is 94.65% (the Query %id). Note the difference in the values of the Target and Query %id reflects the different protein lengths for the human and tarsier BRAF genes.

    Click on the View Sequence Alignments link in the Orthologue column to View Protein Alignment in Clustal W format.

    Conserved amino acids are indicated by asteriks. The alignment around the N-terminus looks poorer, when compared to the C-terminus end.

  2. Click on Gene tree in the left hand menu. All of the primates are enclosed in a lilac box. Click on the furthest left node in the box to get a pop-up labelled Primates. Alternatively, scroll to the bottom of the page, and select Order from Collapse all the nodes at the taxonomic rank. Primates will appear as a red triangle. Click on Wasabi viewer in the pop-up menu to see the alignment. Scroll to position 450.

    Greater bamboo lemur, mouse lemur, Sumatran orangutan, crab-eating macaque, olive baboon, Bolivian squirrel monkey, white-tufted-ear marmoset and Ma’s night monkey all match the human sequence.

Synteny

Start at Ensembl homepage.

  1. Find the rhodopsin (RHO) gene in human. Go to the Location tab and click Synteny at the left. Are there any syntenic regions in duck? If so, which chromosomes are shown in this view?

  2. Stay in the Synteny view. Is there a homologue in duck for human RHO? Are there more genes in this syntenic block with homologues? Which duck chromosome is this human genomic region syntenic to?

  1. Search for human RHO from the home page. Click on RHO genomic coordinates 3:129528639-129535344:1 in the search results to directly open the Location tab. Click Synteny at the left and change the species to Duck next to the image.

    Yes, there are multiple syntenic regions in duck to human chromosome 3, which is in the centre of this view. Duck chromosomes 1, 2, 7, 9, 13, and Z have syntenic regions to human chromosome 3.

  2. Scroll down to the bottom of the page to see a list of homologous genes on both genomes.

    Human RHO is homologous to RHO (ENSAPLG00000005189) in duck. Click 15 upstream genes or 15 downstream genes to view neighbouring genes in this syntenic block. There are many neighbouring genes with homologues in duck. Human genes in this region are homologous to duck genes on chromosome 13, which is also indicated by the red boxes in the image above as this genomic block on human chromosome 3 is syntenic to duck chromosome 13.

Exploring a genomic region between in mouse and human

You find a deletion in the mouse genome that severely affects development, and you want to investigate it. In Ensembl, go to this region (base pair coordinates 136197000-136283000 on mouse chromosome 4).

  1. How many protein coding transcripts does the gene in this region have?

  2. You want to see if any of these transcripts might be expressed in brain tissue. Turn on the RNASeq gene model for brain.

  3. Find this same view (Region in Detail) for the human orthologue of the mouse gene.

  4. To understand crucial variants, you want to investigate if there are any missense variants in the most diverse population available. Turn on the 1000 Genomes AFR (African population) track.

  1. Go to the Ensembl homepage, select Mouse from the Species search drop-down and type 4:136197000-136283000 in the text box. Alternatively, you can leave the Species search drop-down list like it is and type mouse 4:136462271-136562270 in the text box. Click Go.

    This region encodes the Luzp1 gene, which has six protein-coding transcripts (splice variants).

  2. Click Configure this page in the side menu (or on the cog wheel icon in the top left-hand side of the bottom image). Click RNASeq models in the left hand menu. Click on the box in the matrix for Brain: Gene model. Turn it on, using Expanded with labels. Close the menu.

    There is a gene model based on RNASeq evidence that matches to Luzp1-004.

  3. There are several ways to do this. One way is to click on a Luzp1 transcript, and in the pop-up menu, click on the gene ID ENSMUSG00000001089. This should bring you to the Gene tab for mouse Luzp1. In the left hand menu, click Orthologues. Find the human orthologue in the list, and click on its location 1:23410516-23504301:-1. You should be in the Region in detail view for human, showing Luzp1 transcripts.

  4. Click Configure this page in the side menu. Type 1000 in the Find a track text box. Select 1000 Genomes - AFR Save and close the new configuration by clicking on ✓ (or anywhere outside the pop-up window).

    There are many missense variants (coloured yellow) in these transcripts.

BioMart

Demo: BioMart

Follow these instructions to guide you through BioMart to answer the following queries:

  1. How many genes are found in Fusarium solani that do not have an orthologue in Fusarium oxysporum?
  2. How many of these are associated with a pathogenic phenotype of “reduced virulence”?
  3. Export the gene name, locations and GO terms associated with these genes
  4. Export their cDNA sequences  
     
     

Step 1: Choose database and dataset

Click on BioMart located on the navigation bar at the top of any fungi.ensembl.org page.

 
 
 

Step 2: Choose appropriate filters

We want to find all genes in F. solani that do not have an orthologue with F. oxysporum. We need to filter the dataset to look only at these genes.

Using the count function we can see that there are 6,137 F. solani genes (out of a total of 16,163) that do not have an orthologue in F. oxysporum. We also want to find out how many of these are associated with the PHI-base Pathogen phenotype ‘reduced virulence’.

We can now see from the count information that we have 3 genes associated with this phenotype that do not have any orthologues with F. oxysporum.  
 
 

Step 3.1: Select attributes (features)

Attributes are defined by what we would like to learn about the data. We want to find out more information about the genes:

  1. Gene name
  2. Locations
  3. Associated GO terms
  4. cDNA sequences

You can only select one attribute category at a time. We can answer points 1-3 in a single query, but we will need to do a second query to answer point 4. Make sure that Features category is selected at the top of the page. Expand the GENE tab and select the following features:

  • Gene Name
  • Chromosome/scaffold name
  • Gene start (bp)
  • Gene end (BP)

Next, expand the EXTERNAL tab. This section contains lots of identifiers from databases outside of Ensembl. Select the following features:

  • GO term accession
  • GO term name

 
 
 

Step 4: Get the results

You can download the data if you desire. The table presented shows a sub-sample of 10 results to enable you to check you have the correct attributes.

We can see that all of these genes are located in the same region. Perhaps this has something to do with why they are not found in F. oxysporum…? You can click on the location links and explore the synteny between the two species.  
 
 

Step 3.2: Select attributes (sequences)

To complete the fourth part of the demo, we need to export the cDNA sequences of our three genes. To do this, go back to the Attributes section in the left-hand panel and select Sequences from the attributes categories at the top of the page.

Also expand the HEADER INFORMATION tab and select Gene name. Click Results in the left-hand panel to see your sequences.

 
 
 

For more details on BioMart, have a look at this publication: Kinsella RJ, Kähäri A, Haider S, et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database : the Journal of Biological Databases and Curation. 2011 ;2011:bar030. DOI: 10.1093/database/bar030. PMID: 21785142; PMCID: PMC3170168.

Ensembl Fungi: variation data in the Saccharomyces cerevisiae mitochondrial genome

In Ensembl Fungi, retrieve the variation ID, chromosome name, position in bp and alleles for all sequence variants in the mtDNA. Export the results as an XLS table and have the results sent to you by email.

Go to BioMart. You can find a shortcut on the navigation bar at the top of any Ensembl Fungi page.

  1. Choose your dataset: select Variation and then Saccharomyces cerevisiae short variants from the drop-down menus.

  2. Open Filters in the left-hand panel, expand REGION then select Chromosome/scaffold: Mito.

  3. Open Attributes in the left-hand panel, expand VARIANT ASSOCIATED INFORMATION, then select Variant Alleles in addition to the pre-selected attributes.

  4. Hit Results in the top left-hand corner to see a preview table. At the top of the results page, select Export all results to, choose Compressed web file (notify by email) then choose xls as the file type. Input your email address then hit Go. You will receive your results by email.

Ensembl Protists: exporting homologues with BioMart

Go to Ensembl Protists. For a list of , export the Find Leishmania major orthologues for these Trypanosoma brucei genes: Tb927.3.3470, Tb927.3.3520, Tb11.01.7770, Tb927.8.5110, Tb927.1.1420

  1. Go to BioMart, select the Ensembl Protists Genes database and choose the Trypanosoma brucei genes dataset.

  2. Click on Filters in the left panel. Expand the GENE section. Enter the gene list in the Input external references ID list box. Gene stable ID(s) should be preselected.

  3. Click on Attributes in the left panel. Select the Homologues attributes at the top of the page. Expand the GENE section. Deselect Gene stable ID version, Transcript stable ID and Transcript stable ID version. Expand the ORTHOLOGUES [K-O] section and select Leishmania major Gene ID.

  4. Click Results. Select View: All rows as HTML to open the entire table in a new tab. If you prefer, you can also export as a CSV, TSV or XLS file by using the Export all results to option.

Ensembl Plants: finding genes by protein domain

One class of disease resistance (R) genes in plants are the TIR-NBS-LRR genes, that code for proteins that contain an N-terminal Toll/Interleukin receptor homology region (TIR), a nucleotide binding site (NBS) and a C-terminal leucine rich repeat (LRR). TIR-NBS-LRR genes are common in dicots but seem to be rare in monocots (Tarr and Alexander. TIR-NBS-LRR genes are rare in monocots: evidence from diverse monocot orders. BMC Res Notes 2009 Sep 8;2:197).

The ID for the TIR domain in the Pfam (protein family) database is PF01582.

Use BioMart in Ensembl Plants to generate a list of all Solanum tuberosum (potato; a dicot) genes that are annotated to contain a TIR domain. Include the Ensembl stable ID and gene description. Do the same for Zea mays (maize; a monocot).

Do your results confirm the findings of Tarr and Alexander?

  1. Go to Ensembl Plants and click on the link Tools at the top of the page. Click on BioMart. Choose the Ensembl Plants Genes database. Choose the Solanum tuberosum genes dataset.

  2. Now, filter for the genes containing a TIR domain: Click on Filters in the left panel. Expand the PROTEIN DOMAINS AND FAMILIES section. Select Limit to genes with these family or domain IDs and enter PF01582 in the box. Select Pfam ID(s) (e.g. PF00004) from the drop-down menu. Click on Count in the toolbar.

    This should give you 78 / 40336 genes.

  3. Specify the attributes to be included in the output (note that a number of attributes will already be selected by default). Click on Attributes in the left panel. Expand the GENE section. Deselect Transcript stable ID. Select Gene description.

  4. Now click on Results. The first 10 results are displayed by default. You can display all results by selecting View: All rows from the drop-down menu. If you prefer, you can also export as a CSV, TSV or XLS file by using the Export all results to option.

    Repeat the above for the Z. mays genes (B73 RefGen_v4) dataset.

    Your results should show 3 / 44303 genes containing a TIR domain for maize. The results confirm the findings of Tarr and Alexander.

Mapping Uniprot IDs in Ensembl Plants

BioMart is a very handy tool when you want to map between different databases. The following is a list of IDs from the UniProtKB/Swiss-Prot database of Arabidopsis thaliana proteins that are supposedly involved in flavonoid metabolism: P42813, Q9LS08, Q9ZST4, Q9SYM2, P51102, Q9LPV9, Q9FE25, Q96323, Q9FKW3, P13114, P41088, Q9S818, Q96330, O22203, Q39224, O22264, Q9SD85, Q9LYT3, Q9FJA2, Q43128, P43254, O04153, Q43125, Q9S9P6, Q94C57, Q9LNE6, Q9FK25, Q9SYM5, Q9ZQ95

Using BioMart in Ensembl Plants a list that shows to which Ensembl Gene IDs these UniProtKB/Swiss-Prot IDs map. Also include the gene name and description.

  1. Go to BioMart in Ensembl Plants. Click the New button on the toolbar in the top left-hand corner to start a new query. Choose the Ensembl Plants Genes database. Choose the Arabidopsis thaliana genes dataset.

  2. Click on Filters in the left panel. Expand the GENE section. Select ID list limit – UniProt/Swissprot ID(s). Enter the list of IDs in the text box (either comma separated or as a list).

  3. Click on Attributes in the left panel. Expand the GENE section. Deselect Transcript Stable ID. Select Gene name and Gene description. Expand the EXTERNAL section. Select UniProtKB/SwissProt ID(s).

  4. Click the Results button on the toolbar. Select View: All rows as HTML or export all results to a file. Tick the box Unique results only.

    Your results should show 28 / 32833 genes.

Convert IDs using BioMart

BioMart is a very handy tool when you want to convert IDs from different databases. The following is a list of 29 IDs of human proteins from the NCBI RefSeq database:
NP_001218, NP_203125, NP_203124, NP_203126, NP_001007233, NP_150636, NP_150635, NP_001214, NP_150637, NP_150634, NP_150649, NP_001216, NP_116787, NP_001217, NP_127463, NP_001220, NP_004338, NP_004337, NP_116786, NP_036246, NP_116756, NP_116759, NP_001221, NP_203519, NP_001073594, NP_001219, NP_001073593, NP_203520, NP_203522

Use BioMart in Ensembl to generate a list that shows to which Ensembl gene IDs and to which gene names these RefSeq IDs correspond. Do these 29 transcripts correspond to 29 genes?

  1. Go to BioMart. You can find a shortcut to the tool on any Ensembl page in the navigation bar at the top of the page. Click New in the top left-hand menu if you need to start a new query. Choose the Ensembl Genes database. Choose the Human genes dataset.

  2. Click on Filters in the left panel. Expand the GENE section. Select Input external references ID list - RefSeq peptide ID(s) and enter the list of IDs in the text box (either comma separated or as a list).

HINT: You may have to scroll down the menu to see these.

Count shows 10 genes (remember one gene may have multiple splice variants coding for different proteins, that is the reason why these 29 proteins do not correspond to 29 genes).

  1. Click on Attributes in the left panel. Select the Features attributes page. Expand the External section. Select HGNC symbol and RefSeq Peptide ID from the External References section.

  2. Click the Results button on the toolbar. Select View: All rows as HTML or export all results to a file.

Exporting homologues with BioMart

Go to Ensembl’s BioMart. For a list of Ciona savignyi Ensembl genes, export the human orthologues:
ENSCSAVG00000000002, ENSCSAVG00000000003, ENSCSAVG00000000006, ENSCSAVG00000000007, ENSCSAVG00000000009, ENSCSAVG00000000011

Do all of these genes have a homologue in human?

  1. Go to BioMart (you can find a shortcut in the navigation bar at the top of any Ensemblpage) and click New. Choose the Ensembl Genes database. Choose the Ciona savignyi genes (CSAV 2.0) dataset.

  2. Click on Filters in the left panel. Expand the GENE. Enter the gene list in the Input external references ID list box. Gene stable ID(s) should be preselected.

  3. Click on Attributes in the left panel. Select the Homologues attributes at the top of the page. Expand the GENE section. Deselect Gene stable ID version, Transcript stable ID and Transcript stable ID version. Expand the ORTHOLOGUES [F-J] section. Select Human gene stable ID.

  4. Click Results. Select View: All rows as HTML.

    All but ENSCSAVG00000000006 have a homologue in human.