Ensembl TrainingEnsembl Home

Ensembl session at From Specimens to Genomes course

Course Details

Lead Trainer
Emily Perry
Event Date
2021-10-15
Location
  EBI
Description
Explore the Ensembl Rapid Release site to see genomes of newly-annotated species.

Materials

CC-BY 4.0 logo

Demos and exercises

Exploring new genomes with Ensembl Rapid Release

Newly annotated genomes are all added to Ensembl Rapid Release. These genomes have minimal annotation, with genes mapped onto the genome, InterProScan protein domain analysis and BLAST indexing. There is no BioMart and no comparative genomics analysis. Since gene name and function is usually inferred from orthologues in better annotated species, this is not available for species in Rapid Release.

From the Rapid Release homepage, click on View and download available data for all species.

Here we have a list of all the species available in Rapid Release, including latin and common names, genome accession and annotation source. There are also links out to the FTP site where you can download whole genome flatfiles. There are annotation files with the loci and sequences of the genes, the whole genome sequence and BAM files. The BAM files are the RNA-seq data that was used to annotate the genes.

Click on BAM for Ailuropoda melanoleuca (Giant panda). In many internet browsers, this will try to open in an FTP client, so you may need to change the ftp at the beginning of the URL to http to open the site.

Here you can see the BAM files (.bam), BAM index (.bam.bai) and read coverage wiggle files (.bam.bw) for each of the tissues that were sampled for RNA-seq.

Go back to the species list and click on the panda latin name Ailuropoda melanoleuca.

Here you can see links to example data and statistics about the genome. Let’s search for a gene: ENSAMEG00000015675.

The gene page has a summary of the gene information, including a graphic and table of the transcripts.

There is a limited amount of annotation attached to the gene. There are GO terms, which were assigned based on protein domains from sequence analysis. Click on any of the GO categories to see the terms.

Let’s explore the Location tab by clicking on Location: 1:511,592-546,674 at the top left. You can view RNA-seq data across this locus by clicking on Configure this page. This menu shows us all the possible tracks you can view. Click on RNA-seq models in the left-hand menu.

You can use the matrix to add data by cell type. The cell types are listed down the side, with data type along the top. Click on the boxes to add the data individually, or hover over the keys at the top or side to get the option to select all. Hover over Gene models then pick Select all to see all RNA-seq based gene models in different cell types, which were used to annotate genes at this region

One way we can find out more about the gene by BLAST-ing the sequence against a more well annotated gene. We need to export the sequence first. First go the gene tab, then expand Show transcript table. Click on the transcript ENSAMET00000049361.1. This will take you to the transcript tab, where you can click on Protein to get the protein sequence. Copy the sequence and launch www.ensembl.org, then go to BLAST/BLAT.

Paste in the Panda sequence, then search against the Protein database and hit Run. Go to the results when the job is done.

The top hit is to human SMC3, which we can assume is the orthologue of the Panda gene. Click on the gene to go to the gene tab. You can find significant information about the human gene from the gene tab, much of which probably applies to the Panda gene too, for example Phenotypes associated with the human gene, GO terms linked to the human gene, External references to other databases, tissue-based Gene Expression from Expression Atlas and biochemical Pathways the protein is involved in from Reactome.

Investigate data in Ensembl Rapid Release

You’re interested in finding the gene UQCRQ in Wild Bactrian camel, which is only available in Ensembl Rapid Release.

(a) Search for UQCRQ in human on ensembl.org and export the protein sequence of the canonical transcript.

(b) Use BLAST on rapid.ensembl.org to search for the human UQCRQ sequence in the Wild Bactrian camel protein database. How many hits do you get? Based on these results, what can you say about the copy number of this gene in Wild Bactrian camel?

(c) Go to the location of the top BLAST hit and turn on tracks for RNA-seq gene models. In what cell types is there evidence of expression of this gene?

(a) Start at ensembl.org and search for UQCRQ. Go to the top search result, ENSG00000164405.

Expand the transcript table by clicking on Show transcript table, then select the canonical transcript ENST00000378670.8 and go to Protein to select the protein sequence.

(b) Open rapid.ensembl.org and go to BLAST. Paste the protein sequence into the box.

In the Search against box, click on the cross to remove whatever species is there, then select Add/remove species. Type Wild Bactrian camel into the search box then select Camelus ferus (Wild Bactrian camel) - GCA_009834535.1 reference when it appears, then hit Apply.

Select Protein database, then hit Run. When your job has completed, click on View results.

You should have two hits, to ENSCFEP00005028030 and ENSCFEP00005005840. Both hits are full length and highly scoring with low e-values. This suggests that there was a gene duplication in Wild Bactrian camel.

(c) Click on 3:92177026-92177937 to go to the location of the top hit. You will see the BLAST hit in the region, mapping to the two coding exons of the gene.

Click on Configure this page, then go into the RNASeq models section. Hover over Gene models then select All. Close the menu.

You will see gene models in all the cell types listed, indicating expression.