Exploring the MYH9 gene in human

  1. In Ensembl, find the human MYH9 (myosin, heavy chain 9, non-muscle) gene and open the Gene tab.
    • On which chromosome and which strand of the genome is this gene located?
    • How many transcripts (splice variants) are there and how many are protein coding?
    • What is the longest protein-coding transcript, and how long is the protein it encodes?
    • Which transcript would you take forward for further study?
  2. Click on Phenotypes at the left side of the page. Are there any diseases associated with this gene, according to Mendelian Inheritance in Man (MIM)?

  3. What are some functions of MYH9 according to the Gene Ontology (GO) consortium? Have a look at the GO: Biological process pages for this gene.

  4. In the transcript table, click on the transcript ID for MYH9-201, and go to the Transcript tab.
    • How many exons does it have?
    • Are any of the exons completely or partially untranslated?
    • Is there an associated sequence in UniProtKB/Swiss-Prot? Have a look at the General identifiers for this transcript.
  5. Are there microarray (oligo) probes that can be used to monitor ENST00000216181 expression?
  1. Select Human from the Species drop-down list and type MYH9. Click Go. Click on MYH9 (Human Gene) in the search results which will send you to the Gene tab.
    • The gene is located on chromosome 22 on the reverse strand.
    • Ensembl has 23 transcripts annotated for this gene, of which 6 are protein-coding.
    • The longest protein-coding transcript is MYH9-215 and it codes for a protein that is 1,981 amino acids long.
    • MYH9-201 is the transcript I would take forward for further study, as it is the MANE Select transcript (for a description, mouse-over the MANE Select flag in the transcript table).
  2. Click on Phenotypes in the left-hand panel to see the associated phenotypes. There is a large table of phenotypes. To see only the ones from MIM, type MIM into the filter box at the top right-hand corner of the table.

    These are some of the phenotypes associated with MYH9 according to MIM: Deafness, Autosomal dominant 17 and Macrothrombocytopenia and granulocyte inclusions with or without nephritis or sensorineural hearing loss. You can click on the records for more information.

  3. The Gene Ontology project maps terms to a protein in three classes: biological process, cellular component, and molecular function. Click on GO: Biological process on the left-hand panel. Angiogenesis, cell adhesion, and protein transport are some of the roles associated with MYH9. All GO terms are associated with a single transcript: ENST00000216181.

  4. Click on ENST00000216181.11 in the transcript table. You should now be on the Transcript tab.
    • It has 41 exons, shown in the Transcript summary.

    Click on the Exons link in the left-hand panel.

    • Exon 1 is completely untranslated, and exons 2 and 41 are partially untranslated (UTR sequence is shown in orange). You can also see this in the cDNA view if you click on the cDNA link in the left side menu.

    Click on General identifiers in the left-hand panel.

    • P35579.247 from UniProt/Swiss-Prot matches the translation of the Ensembl transcript. Click on P35579.247 to go to UniProtKB, or click align for the alignment.
  5. Click on Oligo probes in the left-hand panel.

    Probesets from Affymetrix, Agilent, Codelink, Illumina, and Phalanx OneArray match to this transcript sequence. Expression analysis with any of these probesets would reveal information about the transcript. Hint: this information can sometimes be found in the [ArrayExpress Atlas] (https://www.ebi.ac.uk/biostudies/arrayexpress).

Finding a gene associated with a phenotype

Phenylketonuria is a genetic disorder caused by an inability to metabolise phenylalanine in any body tissue. This results in an accumulation of phenylalanine causing seizures and intellectual disability.

(a) Search for phenylketonuria from the Ensembl homepage and narrow down your search to only genes. What gene is associated with this disorder?

(b) How many protein coding transcripts does this gene have? View all of these in the transcript comparison view.

(c) What is the MIM gene identifier for this gene?

(d) Go to the MANE Select transcript and look at its 3D structure. In the model 2pah, how many protein molecules can you see?

(a) Start at the Ensembl homepage (http://www.ensembl.org).

Type phenylketonuria into the search box then click Go. Choose Gene from the left hand menu.

The gene associated with this disorder is PAH, phenylalanine hydroxylase, ENSG00000171759.

(b) If the transcript table is hidden, click on Show transcript table to see it.

There are six protein coding transcripts.

Click on Transcript comparison in the left hand menu. Click on Select transcripts. Either select all the transcripts labelled protein coding one-by-one, or click on the drop down and select Protein coding. Close the menu.

(c) Click on External references.

The MIM gene ID is 612349.

(d) Open the transcript table and click on the ID for the MANE Select: ENST00000553106.6. Go to PDB 3D protein model in the left-hand menu.

The model 2pah is shown by default. It has two protein molecules in it. You may need to rotate the model to see this clearly.

Exploring the Dpp6 gene in mouse

Genetic variation in the dipeptidylpeptidase 6 Gene (DPP6) in humans has previously been strongly associated with amyotrophic lateral sclerosis (ALS), a lethal disorder caused by progressive degeneration of motor neurons in the brain.

  1. Go to the Ensembl homepage, search for the Dpp6 gene in mouse and click on the transcript ID ENSMUST00000071500 to open the transcript tab. How many exons make up this transcript?

  2. Click on Exons to display the exon sequences of the transcript. Which exon contains the translation start? What is the exon ID of the largest exon? What is the start and end phase of exon 2?

  3. Go to the Protein summary. How many protein domains or features fall within the second exon? What is the Pfam protein domain at the C-terminus of the protein and how many exons does it fall into? Which amino acid positions does the domain above cover?

  4. Go to Domains and features. Which domains are associated with Pfam? How many genes in the mouse genome have the IPR002469 domain? What chromosomes are these genes found on?

  1. Select Mouse from the Species search drop-down and type Dpp6 and click Go. Click on Dpp6-201 (Mouse Transcript, Strain: reference (CL57BL6)) in the results.

    ENSMUST00000071500.13 consists of 26 exons.

  2. Click on Exons in the left-hand panel. The translation start is found in the first exon (ENSMUSE00000725552), shown in dark blue text.

    The largest exon is the final exon (856 bp), which has the exon ID ENSMUSE00000773588. Exon 2 has a start and end phase of 0 and 1 respectively, which means that the codon at the start of the exon starts at the first nucleotide and the codon at the end of the exon ends at nucleotide 2. Notice that the end phase of each exon is the same as the start phase of the next exon.

  3. Click on Protein summary in the menu on the left hand side of the page. Alternating exons are shown on the protein as different shades of purple.

    There are two predicted protein domains that fall within the second exon: a transmembrane helix and low complexity peptide sequence (Seg). You can click on the track names to find a description.

    Click on a domain or feature to view further information.

    The C-terminal Pfam domain is Peptidase_S9 (PF00326), which spans or partially spans seven exons, covering amino acid positions 582-787.

  4. Click on Domains & features.

    Looking at the domains table you should notice that there are two domains associated with Pfam: PF00326 and PF00930.

    Click on Display all genes with this domain next to IPR002469. This should now display the genes that have the IPR002469 domain located on the karyotype and as a table.

    6 genes have this domain and they are found on chromosomes 1, 2, 5, 9 and 17.

Exploring the CCD7 gene in Arabidopsis thaliana

  1. Find the Arabidopsis thaliana CCD7 gene on Ensembl Plants. On which chromosome and which strand of the genome is this gene located?

  2. Where in the cell is the CCD7 protein located?

  3. What is the source of the assigned gene name?

  4. How many transcripts does it have? How long is its longest transcript (in bp)? How long is the protein it encodes? How many exons does it have? Are any of the exons completely or partially untranslated?

  1. Go to the Ensembl Plants homepage (http://plants.ensembl.org/). Select A. thaliana from the species list and type CCD7 in the search box. Click Go and click on the gene ID AT2G44990. You can find the strand orientation and the location under Summary in the Gene tab.

    The A. thaliana CCD7 gene is located on chromosome 2 on the forward strand.

  2. Click on GO: Cellular component in the left-hand panel.

    The protein is located in the chloroplast and plastid.

  3. Click on Summary in the side menu.

    The gene name is assigned and imported from NCBI gene (formerly Entrezgene).

  4. Click on Show transcript table.

    There are 3 transcripts. The longest one is 2005 bp and the length of the encoded protein is 622 amino acids.

    Click on the transcript ID AT2G44990.3 in the transcript table. You can find the number of exons in under in the summary information at the top of the page.

    It has 6 exons.

    Click on Sequence: Exons in the left-hand panel.

    The first and last exons are partially untranslated (sequence shown in orange). This can also been seen from the fact that in the transcript diagrams on the Gene Summary and Transcript Summary pages the boxes representing the first and last exon are partially unfilled.

Exploring a bacterial gene in Clostridium sporogenes

Start in Ensembl Bacteria and select the Clostridium sporogenes (GCA_001444695) genome.

  1. What GO: biological process terms are associated with the PolC gene?

  2. Go to the transcript tab for the only transcript, OQP95999. How long is the transcript?

  3. What domains can be found in the protein product of this transcript? How many different domain prediction methods agree with each of these domains?

  1. From the Ensembl Bacteria homepage, select Clostridium sporogenes by beginning to write the species name and selecting the species from the auto-complete list. Type PolC and click on the gene ID VT92_0235670. Click on GO: biological process in the left-hand panel.

    There are two terms listed: GO:0006260, DNA replication and GO:0006261, DNA-templated DNA replication.

  2. Click on the transcript named OQP95999 or on the Transcript tab.

    OQP95999 is 4299 bp in length.

  3. Click on either Protein Summary or Domains & features in the left hand menu to see graphically or as a table respectively.

Exploring a gene in Escherichia coli

Start in Ensembl Bacteria and search for the Escherichia coli str. K-12 substr. MG1655 (GCA_000005845) genome.

  1. What GO: biological process terms are associated with the Era gene?

  2. How many different InterPro domains are found in the protein product of this gene?

  3. What is the associated UniProt ID of the transcript?

Enter part of the name into the genome search box (e.g. MG1655) and then select the correct genome to go to the species information page.

  1. Enter Era into the search box and hit Go. Click the link in the first hit to go to the era gene page. From here, click GO: Biological process in the left-hand menu.

    There are three GO IDs: GO:0000028, GO:0006468, GO:0042274 and GO:0046777.

  2. Click on the transcript ID AAC75619 in the transcript table on the Gene tab. In the Transcript tab, go to Domains & features in the left-hand panel. Count the number of unique InterPro IDs in the table.

    11 different InterPro domains are found in the protein product of Era.

  3. You can find the UniProt ID in the transcript table or under General identifiers in the left-hand panel.

    The UniProt ID is P60785.

Exploring a gene in Magnaporthe oryzae

We’re going to look at the gene ATG8 in Magnaporthe oryzae in Ensembl Fungi. This gene is involved in autophagy, and targeted silencing of this gene inhibits infection (you can find further info in Wilson and Talbot, Nature Reviews Microbiology volume 7, pages 185–195 (2009)).

  1. Find the genomic sequence of the M. oryzae ATG8 gene. How many exons does the gene have?

  2. What is the GO accession ID for this gene?

  3. How many transcripts does the gene have? Download the cDNA sequence in FASTA format to your computer.

  4. Are there any entries of the gene in external databases? If so, which ones?

  1. From the Ensembl Fungi homepage, type ATG8 into the Search for a gene search bar, click the drop-down menu and select Magnaporthe oryzae and click the Go button. Click on the gene ID **M_BR32EuGene_00029551, which will open the **Gene tab. Go to Sequence in the left-hand panel. _ATG8 exons are in red font.

    The ATG8 gene has 3 exons.

  2. Go to Ontologies: GO: Biological process in the left-hand panel.

    The accession ID is GO:0006914.

  3. Click on Show transcript table underneath the gene summary information at the top of the page.

    The gene has 1 transcript.

    Click on Sequence: cDNA in the left-hand panel. You can export the sequence by clicking the Download sequence button, which will open a pop-up menu. Select FASTA from the drop-down list and select cDNA only. You can download the sequence as is or, if you have a large sequence, you can download the compressed file.

  4. Click on External References: General identifies in the left-hand panel. You will find hyperlinks to entries in external databases under the Database identifier column.

    Yes, there are entries in the European Nucleotide Archive (ENA), INSDC, UniParc and UniProtKB.

Exploring a gene in Plasmodium falciparum

  1. Find the Plasmodium falciparum PF3D7_1145400 gene on Ensembl Protists. On which strand is this gene located? What are the coordinates of the gene?

  2. How long is its transcript (in bp)? How long is the protein it encodes? How many exons does it have?

  3. What is the Uniprot ID that maps to the translation of this transcript?

  4. What are the GO:Biological process(es) associated with PF3D7_1145400?

  1. Go to the Ensembl Protists homepage. Select Plasmodium falciparum from the species list and type PF3D7_1145400 in the search box. Click Go. Click on PF3D7_1145400 in the search results. You can find the strand orientation and coordinates in the gene Summary page.

    PF3D7_1145400 is located on the reverse strand of chromosome 11 between 1,800,544 and 1,803,550.

  2. Click on Show transcript table.

    The transcript is 2,514 base pairs and the length of the encoded protein is 837 amino acids.

    Click on the transcript ID CZT99117 in the transcript table.

    It has four exons.

  3. You can find this information in a number of places: the transcript table, External references on the Gene tab or General identifiers on the Transcript tab.

    The UniProt ID that maps to protein encoded by the PF3D7_1145400 transcript is Q8IHR4.

  4. Click on GO: Biological process in the side menu of the Gene tab.

    The PF3D7_1145400 gene is involved in mitochondrial fission.