Methylation data in human
This exercise requires you to combine the knowledge you have gained about different aspects of Ensembl. It is designed to be challenging and force you to come up with solutions yourself.
The human PDHA2 gene, that encodes for a subunit of the pyruvate dehydrogenase complex, is exclusively expressed in spermatogenic cells. In the paper ‘Human testis-specific PDHA2 gene: Methylation status of a CpG island in the open reading frame correlates with transcriptional Activity’ (Pinheiro et al Mol Genet Metab. 2010 Apr;99(4):425-30), two CpG islands in the PDHA2 gene are reported, one encompassing the core promoter region and extending into the open reading frame, the other exclusively located in the coding region. The latter CpG island was shown to be methylated in somatic tissues but demethylated in testicular germ cells and has therefore been proposed to play an important role in the tissue-specific expression of the PDHA2 gene.
(a) Find the PDHA2 gene for human and go to the Region in detail page. Zoom out one step, so that 5 kb around the PDHA2 gene is shown.
(b) Turn on the CpG islands track. Two CpG islands are reported in the PDHA2 gene by Pinheiro et al (2010). Do they appear in this track? If not, why not? (Tip: turn on Display empty tracks to confirm that a track is on but has no data.)
(c) Confirm the existence of the two CpG islands using the EMBOSS program CpGPlot on the sequence around the PDHA2 gene.
(d) Upload the CpG islands found by CpGPlot using Manage your data. Use BED format, which in its simplest form just consists of the chromosome and the start and end coordinates, separated by spaces (as an optional fourth field, you can add a name/description). The genomic start and end coordinates of the CpG islands can be calculated from the genomic start coordinate of the sequence on which the CpGPlot program was run and the relative location of the CpG islands on this sequence as given by the CpGPlot output.
(e) Create a link to allow you to show your new BED track to colleagues, compared to the %GC track.
(f) Is there a regulatory feature at the 5’ end of PDHA2? What kind? What cell type(s) is it active in? What evidence supports the presence of this feature?
(g) Turn on the RNASeq tracks for different tissues. Is there evidence that PDHA2 is expressed in one tissue more than others?
(h) How well conserved is the region of the PDHA2 gene amongst the eutherian mammals? Are the CpG islands conserved?
(i) How many GO: Biological process terms are associated with PDHA2? Can you export the sequences of all human genes that are also associated with the first of these terms?
(j) Can you fetch the gene sequence for PDHA2 in FASTA using the Ensembl REST API?
(a) Go to the Ensembl homepage. Select Search: Human and type PDHA2 in the for text box. Click Go. Click on 4:95840019-95841474:1. Zoom out one step, so that the 5kb region around the PDHA2 gene is shown.
You may want to turn off all tracks that you added to the display in the previous exercises as follows: Click Configure this page in the side menu. Click Reset configuration. SAVE and close.
(b) Click Configure this page in the side menu. Type cpg in the Find a track box. Select CpG islands. SAVE and close.
No CpG islands are shown. As for the inclusion of CpG islands into the Ensembl database for human a minimum length of 400 bp is required, the reason for this could be that the CpG islands in the PDHA2 gene are shorter than 400 bp. However, there is a %GC track, which shows that the region that comprises the 5’ part of the PDHA2 gene and the region directly upstream of the gene has a high %GC (the red line in the %GC track indicates 50% GC). It is difficult / impossible to distinguish individual CpG islands in this track, though.
(c) Click Export data in the side menu. Click Next>. Click on Text.
Select and copy the sequence. Go to http://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html. Paste the sequence into the text box. Click Run.
CpGPlot does confirm the existence of two CpG islands in the PDHA2 gene region of lengths 200 and 263 bp, respectively. So, it is indeed because of their length being less than 400 bp that these CpG islands are not present in the Ensembl database.
(d) The genomic coordinates of your CpG islands are the start coordinates of your region of interest (found at the top of your exported FASTA) plus the coordinates of the islands within that region (from EMBOSS). In my case this is:
First island:
start = 95839291 + 734 = 95840025
end = 95839291 + 933 = 95840224
Second island:
start = 95839291 + 1058 = 95840349
end = 95839291 +1320 = 95840611
This gives coordinates for my CpG islands in BED format as:
chr4 95840025 95840224 cpg_island_1
chr4 95840349 95840611 cpg_island_2
Click Add your data in the side menu (Note that if you have previously uploaded data to Ensembl, this box will say Manage your data instead). Click on Upload Data. Type CpG islands in the Name for this upload (optional) box. Select Data format: BED. Copy the BED formatted data into the Paste file box. Click Upload. Click on Go to nearest region with data: 4:95790125-95890125.
The two CpG islands should now be shown on the Region in detail page. They should coincide with the regions of high %GC.
Zoom in on the two CpG islands.
To display the names of the CpG islands: Hover over the CpG islands track name. Hover over the icon of the cog-wheel. Select Labels.
(e) Drag your CpG islands track so that it is next to the %GC track. Click Share this page in the side menu. Select the link and copy. Paste into your internet browser to view.
(f) There is a lilac TFBS at the 5’ end of PDHA2.
Click on the feature then the ID ENSR00001245785 to get to the regulatory tab.
The TFBS is active in H1ESC cells.
Click on Details by cell type, then Select cells and choose H1ESC and close, then Select evidence and choose ALL ON and close.
This region has DNase sensitivity, Rad21 binding and H3K27me3 and H3K36me3 histone modifications.
(g) Click on Configure this page, then select RNASeq models. Turn on the BAM files for all the tissues in Coverage only.
You will see histograms of RNASeq coverage for each of the tissues. The largest number is for the merged read. For the tissue-specific read, Testes have a higher peak than all the other tissues. There are also wider peaks in the Testes track that cover the whole gene, whereas other tissues only have a peak at the 3’ end of PDHA2.
(h) Click on Configure this page, then select Comparative genomics. Turn on the tracks for the Constrained elements for eutherian mammals and Conservation score for 39 eutherian mammals.
The region of the gene itself has high GERP scores, indicated by constrained elements over most of the gene. There is no apparent difference in the conservation score between the CpG islands and their flanking regions.
(i) Click on the Gene Tab, Gene: PDAH2 and select GO table.
There are five terms in the table, the first being GO:0006006, pyruvate metabolic process.
To export the list use BioMart. Click on Search BioMart in the table. This will take you to a BioMart results page where the genes were filtered by GO Term Accession: GO:0006006.
To get sequences, click on Attributes. Choose Sequences. Expand SEQUENCES and select Unspliced (Gene). Expand Header information and deselect Ensembl Transcript ID.
Click Results. You can export these results if you wish.
(j) Go to the REST API documentation page. Click on GET sequence/id/:id to get the documentation for this command.
You will need the stable ID of PDHA2, go to the browser page to find that it is ENSG00000163114.
Use the documentation to construct a URL in the correct form, ie:
http://rest.ensembl.org/sequence/id/:id?format=fasta
Add the ID to the URL to create:
http://rest.ensembl.org/sequence/id/ENSG00000163114?format=fasta
This URL will give you the sequence.