Ensembl TrainingEnsembl Home

Exploring the new Ensembl Genome Browser to retrieve genomic data – ISCB-LATAM SolBio CCBCOL 2024

Course Details

Lead Trainer
Louisse Paola Mirabueno
Associate Trainer(s)
Event Date
2024-11-07
Location
  Virtual
Description
This online tutorial is an introduction to the new Ensembl genome browser (beta.ensembl.org), which presents genomic features for approximately 3,000 species with a new user interface, including visualisation of genes and variants on the Human Pangenome Reference Consortium (HPRC) assemblies. Participants will have the opportunity to learn about the range of data currently available through the new platform, gain hands-on experience in navigating the new Ensembl website to retrieve and interpret data and think about how this data might be informative for their research.
Survey
 Exploring the new Ensembl Genome Browser to retrieve genomic data – ISCB-LATAM SolBio CCBCOL 2024 Feedback Survey

Demos and exercises

Overview of the current Ensembl genome browser

Overview of Ensembl Beta

Exploring gene builds in Ensembl Beta

Exploring the MYH9 gene in Human

  1. In Ensembl Beta, find the MYH9 gene in the human GRCh38 (hg38) genome assembly.
    • On which chromosome and which strand of the genome is this gene located?
    • How many transcripts (splice variants) are there and how many of these are protein-coding?
    • What’s the definition of MANE Select? How many coding exons does the MYH9 MANE Select transcript have?
    • What sequences are available to download?
  2. Let’s explore the protein information that is available for the MANE Select transcript. Open the Entity Viewer.
    • How long is the protein sequence (in aa) and what is the Ensembl protein ID?
    • Open the Protein information panel under the ‘Gene function’ tab. What protein domain annotations are available for this transcript?
    • What is the protein function according to UniProtKB?
  3. We’re going to explore homologues of the MYH9 gene. Open the ‘Gene relationships’ tab.
    • Which species has the most similar sequence in terms of protein similarity? What is the corresponding gene ID?
    • How many transcripts does the homologue have?
  1. Open the Species selector and enter human in the search box or click on the Human icon in the species list at the bottom of the page. Select to add the reference (GRCh38.p14) assembly and click on the green Add button. You should now see your selected species at the top of the page. Click on Find gene next to the species name, enter MYH9 in the search bar and click Go. In the results, click on MYH9 ENSG00000100345.23 and select Genome Browser. Click on the gene in the genome browser or click on the three dots (…) next to MYH9 protein_coding in the track panel on the right.
    • The gene is located on chromosome 22 on the reverse strand.

    In the track panel, click on +22 transcripts to expand the list of transcripts.

    • MYH9 has 23 transcripts. 6 of these are protein_coding (this includes the MANE select transcript) and 3 are defined as protein_coding_CDS_not_defined (a transcript that belongs to a protein_coding gene and does not contain an open reading frame).

    In the track panel, click on the three dots (…) next to the MANE Select transcript to open the transcript panel and find out more details.

    • The MANE Select is a default transcript per human gene that is representative of biology, well-supported, expressed and highly-conserved. The MANE Select transcript has 40 coding exons.

    In the transcript panel, expand the Download option.

    • You can download the genomic sequence and exons of the gene, and genomic sequence, cDNA (the sequence of the spliced exons of a transcript expressed in DNA notation – T rather than U – representing the coding or sense strand), exons, protein sequence and coding sequence (CDS) of the tanscript.
  2. To open the Entity Viewer, scroll to the bottom of the page and click on the Entity Viewer icon. Alternatively, if you are in the genome browser view, click on the first MYH9 transcript and click on View in Entity Viewer in the pop-up menu. Once in the Entity Viewer, click on the MANE Select transcript (this will open a grey information panel underneath).
    • The protein is 1,960 aa long. The Ensembl protein ID is ENSP00000216181.6.

    In the Entity Viewer, switch to the Gene function tab panel, click on the three dots (…) next to the MANE Select transcript to open the transcript panel and find out more details.

    • Protein domains are distinct functional and/or structural units in a protein, which are usually responsible for a particular function or interaction, contributing to the overall role of a protein. There are several methods and algorithms that can be used to classify protein domains into families and functional sites. For the ENSP00000216181.6 protein, PANTHER and Pfam annotations are available.

    In the Gene function tab, click on the UniProtKB/Swiss-Prot P35579 link to open the corresponding entry in UniProt.

    • According to UniProtKB, the function of the protein is as follows: _Cellular myosin that appears to play a role in cytokinesis, cell shape, and specialized functions such as secretion and capping. Required for cortical actin clearance prior to oocyte exocytosis; promotes cell motility in conjunction with S100A4; and during cell spreading, plays an important role in cytoskeleton reorganization, focal contact formation (in the margins but not the central part of spreading cells), and lamellipodial retraction; this function is mechanically antagonized by MYH10.
  3. In the Entity Viewer, switch to the Gene relationships tab to view any homologues of the MYH9 gene. You can filter the homology table by header. The % protein similarity is the precentage of identical amino acid residues aligned against each other. Click on the % Protein similarity header once to sort the table in descending order.
    • The Pan troglodytes (Chimpanzee) homologue is the most similar in terms of protein similarity. The gene ID of the homologue is ENSPTRG00000014309.6.

    Click on the gene ID ENSPTRG00000014309.6, to open the corresponding homologue in the Entity Viewer. Count the number of transcripts you can see under the Transcripts tab.

    • The chimp homologue has 4 transcripts.

Visualising a genomic region in human

Go to Ensembl Beta.

  1. Navigate to the region from 176,092,720 to 176,190,907 on the human GRCh38 assembly on chromosome 2. This region contains the HOXD gene cluster.
    • How many HOXD genes can you find in this region?
    • On a new browser tab, search for the gene cluster in GeneCards. What diseases are associated with the gene cluster? (Hint: you can search for gene clusters by adding an ‘@’ sign to the gene name, i.e. HOXD@).
  2. Navigate to the HOXD1 gene. Let’s explore any overlapping regulatory features.
    • Does this gene overlap any regulatory features? What type of regulatory element is this/are they?
    • Downstream of the regulatory feature, should see a neighbouring regulatory element in the colour purple. What is this regulatory feature? How does it differ from an enhancer or promoter?
  3. Let’s explore variants found within the HOXD1 gene.
    • What groups of variants can you find in the HOXD1 region?
    • Zoom in on the first yellow-coloured variant. What type of variant is this and what are the alleles?
    • What is the source of the variant? Can you provide the variant location in Variant Call Format (VCF)?
  1. In the Genome Browser app, click on the Navigate browser image icon in the track panel on the right. Expand Go to new location, select Chr 2, enter 176,092,720 in Start and 176,190,907 in End. Click on the green Go button. Count the number of HOXD genes in the genome browser on the left.

    There are 9 HOXD genes on the forward and 1 HOXD gene on the reverse strand in this region.

    Go to the GeneCards entry for the HOXD cluster. You can find diseases associate with the cluster in the Summaries for HOXD@ Gene section.

    • The HOXD gene cluster is associated with synpolydactyly and clubfoot.
  2. In the genome browser, click on the HOXD1 (ENSG00000128645.15) gene and select Genome Browser in the pop-up menu. Scroll down to find the Regulatory track, which is indicated with the letter R on the far left. Click on the regulatory feature to see what type it is. Alternatively, you can find a legend in the track panel under the Regulation tab on the right.

    The gene overlaps 1 regulatory feature: the promoter ENSR00000629037.

    In the Regulation (R) track in the genome browser, click on the purple regulatory feature. The pop-up menu will tell you what type of regulatory feature this is. To find a description of the feature, go to the Regulation tab in the track panel on the right. Click on the three dots (…) next to TF binding.

    • This regulatory feature is a TF binding site. These are sites that are enriched for Transcription Factor (TF) binding, but they lack epigenomic evidence to be classified as an enhancer or promoter
  3. In the genome browser, scroll down to the Variation (V) track. In the track, you should see green, yellow and salmon coloured variants. You can find a colour legend in the Variation tab in the track panel on the right.

    The HOXD1 region overlaps various transcript, splicing and protein altering variants.

    Using your mouse, zoom in on the first yellow-coloured variant. Click on the first yellow variant to find out more information about it.

    • This variant (rs1691721387) is a single nucleotide variant (SNV). The reference allele is C and the alternative allele is T.

    In the pop-up menu, select View in Entity Viewer to find out more information about the particular variant. In the track panel on the right, you can find the source of the variant (at the top) and the variant location in VCF (at the bottom).

    • This rs1691721387 variant was imported from dbSNP (release 156). The VCF is as follows: 2 176188810 rs1691721387 C T.

Exploring genes in Bread wheat and its cultivars

  1. In Ensembl Beta, search for the TraesCS2D02G248400 gene in the Triticum aestivum (Bread wheat) IWGSC assembly.
    • How many transcripts are there?
    • What is the definition of ‘Ensmebl canonical’? What does this gene do in Bread wheat?
  2. Align the protein sequence of the Ensembl canonical transcript to cultivars Cadenza, Julius and Paragon.
    • How many hits are found in each of the cultivars?
    • Download your BLAST results. What information can you find in the files?
    • In the BLAST alignment file, what sequence does Query refer to? What sequence does Sbjct refer to?
  1. Open the Species selector and enter bread wheat in the search box or click on the Bread wheat icon in the species list at the bottom of the page. Select to add the IWGSC assembly and click on the green Add button. You should now see your selected species at the top of the page. Click on Find gene next to the species name, enter TraesCS2D02G248400 in the search bar and click Go. In the results, click on TraesCS2D02G248400 and select View in Genome Browser. You can find the number of transcripts in the genome browser view on the left, or the track panel on the right.

    TraesCS2D02G248400 has 2 transcripts.

    Click on the three dots (…) next to the first transcript (TraesCS2D02G248400.2) in the track panel on the right. You can click on the questionmark (?) next to Ensembl canonical to find a description.

    • The Ensembl canonical transcript is a single, representative transcript identified at every locus. The gene codes for an oxygen evolving enhancer protein.
  2. Stay in the transcript panel. Expand the Sequences option, select Protein sequence on the right and click on Blast whole sequence. This will take you to the BLAST app. Click on the blue Select species button and select the cultivars Cadenza, Julius and Paragon in Add species. In the BLAST app, click on the green Run button. Click on the blue Results button in the upper right-hand corner to view your results.

    Cadenza has 3, Julius 10 and Paragon 3 hits.

    Click on the Download icon next to the blue Results button in the upper right-hand corner. This will download your results in a compressed folder. Uncompress the folder and open the files using a plain text editor.

    The folder containst 2 files: the output.txt file contains the BLAST alignments, including details about the sequence, alignment scores and statistics, and the sequence alignment. The table.tsv file contains the metadata of your BLAST query, including any BLAST parameters and the results table you saw on the browser.

    Open the output.txt file.

    The Query (top sequence) refers to our sequence of interest (i.e. the protein in the IWGSC genome). The Sbjct (bottom sequence) refers to the homologue (the protein in the cultivar genome).

Exploring the Escherichia coli K-12 genome

Start in Ensembl Beta and select the Escherichia coli K-12 (ASM584v2) genome in the Species selector.

  1. Open the species information page.
    • What substrain is this genome?
    • What is the assembly (GCA_) ID? In what year was the the original K-12 isolate obtained?
  2. Find the era gene in E. coli K-12.
    • Is the gene known under a different name?
    • What are the coordinates for this gene?
    • What is the biological function of the gene according to PDBe-KB?
  1. Go to Species selector and search for Escherichia coli by entering the species name in the search bar or by clicking on the species icon in the species list underneath. Select to add the species and click on the green Add button. You should now see the E. coli in your species list at the top of the page. Click on the blue Escherichia coli K-12 ASM584v2 to open the assembly information page. Open the track panel on the right.

    The genome assembly is substrain MG1655.

    Stay in the track panel. You can find the assembly ID under Assembly. Click on the assembly ID to open the corresponding entry in the European Nucleotide Archive (ENA), where the original sequence was submitted to.

    The assembly ID is GCA_000005845.2. In the description in the ENA entry, we can see that MG1655 was derived from strain W1485, which was derived by Joshua Lederberg from the original K-12 isolate obtained from a patient in 1922.

  2. In the track panel, click on the Search icon, enter era into the text box and click on era b2566 in the results underneath.

    The era_ gene in the E. coli K-12 genome is also known ‘b2566’ and the coordinates are 2,702,481-2,703,386.

    In the pop-up menu, click on View in Entity Viewer. Switch to the Gene function tab and click on PDBe-KB P06616 to open the corresponding entry in PDBe-KB.

    Accroding to PDBe-KB, the biological function is as follows: An essential GTPase that binds both GDP and GTP, with nucleotide exchange occurring on the order of seconds whereas hydrolysis occurs on the order of minutes. Plays a role in numerous processes, including cell cycle regulation, energy metabolism, as a chaperone for 16S rRNA processing and 30S ribosomal subunit biogenesis. This description is imported from UniProt.