Finding genes by protein domain

Find mouse proteins with Signalp cleavage sites located on chromosome 9.

As with all BioMart queries you must select the dataset, set your filters (input) and define your attributes (desired output). For this exercise: Dataset: Ensembl genes in mouse Filters: Signalp cleavage sites on chromosome 9 Attributes: Ensembl gene and transcript IDs and gene names

Go to the Ensembl homepage (http://www.ensembl.org) and click on BioMart at the top of the page. Select Ensembl genes as your database and Mouse genes as the dataset. Click on Filters on the left of the screen and expand REGION. Change the chromosome to 9. Now expand PROTEIN DOMAINS, also under filters, and select Limit to genes, choosing with With Cleavage site (Signalp) from the drop-down and then Only. Clicking on Count should reveal that you have filtered the dataset down to 217 genes.

Click on Attributes and expand GENE. Select Gene name. Now click on Results. The first 10 results are displayed by default; Display all results by selecting ALL from the drop down menu.

The output will display the Ensembl gene ID, Ensembl Transcript ID and gene names of all proteins with a Signalp cleavage site on mouse chromosome 9. If you prefer, you can also export as an Excel sheet by using the Export all results to XLS option.

Exporting homologues with BioMart

Go to Ensembl’s BioMart. For a list of Ciona savignyi Ensembl genes, export the human orthologues:
ENSCSAVG00000000002, ENSCSAVG00000000003, ENSCSAVG00000000006, ENSCSAVG00000000007, ENSCSAVG00000000009, ENSCSAVG00000000011

Do all of these genes have a homologue in human?

  1. Go to BioMart (you can find a shortcut in the navigation bar at the top of any Ensemblpage) and click New. Choose the Ensembl Genes database. Choose the Ciona savignyi genes (CSAV 2.0) dataset.

  2. Click on Filters in the left panel. Expand the GENE. Enter the gene list in the Input external references ID list box. Gene stable ID(s) should be preselected.

  3. Click on Attributes in the left panel. Select the Homologues attributes at the top of the page. Expand the GENE section. Deselect Gene stable ID version, Transcript stable ID and Transcript stable ID version. Expand the ORTHOLOGUES [F-J] section. Select Human gene stable ID.

  4. Click Results. Select View: All rows as HTML.

    All but ENSCSAVG00000000006 have a homologue in human.

Convert IDs using BioMart

BioMart is a very handy tool when you want to convert IDs from different databases. The following is a list of 29 IDs of human proteins from the NCBI RefSeq database:
NP_001218, NP_203125, NP_203124, NP_203126, NP_001007233, NP_150636, NP_150635, NP_001214, NP_150637, NP_150634, NP_150649, NP_001216, NP_116787, NP_001217, NP_127463, NP_001220, NP_004338, NP_004337, NP_116786, NP_036246, NP_116756, NP_116759, NP_001221, NP_203519, NP_001073594, NP_001219, NP_001073593, NP_203520, NP_203522

Use BioMart in Ensembl to generate a list that shows to which Ensembl gene IDs and to which gene names these RefSeq IDs correspond. Do these 29 transcripts correspond to 29 genes?

  1. Go to BioMart. You can find a shortcut to the tool on any Ensembl page in the navigation bar at the top of the page. Click New in the top left-hand menu if you need to start a new query. Choose the Ensembl Genes database. Choose the Human genes dataset.

  2. Click on Filters in the left panel. Expand the GENE section. Select Input external references ID list - RefSeq peptide ID(s) and enter the list of IDs in the text box (either comma separated or as a list).

HINT: You may have to scroll down the menu to see these.

Count shows 10 genes (remember one gene may have multiple splice variants coding for different proteins, that is the reason why these 29 proteins do not correspond to 29 genes).

  1. Click on Attributes in the left panel. Select the Features attributes page. Expand the External section. Select HGNC symbol and RefSeq Peptide ID from the External References section.

  2. Click the Results button on the toolbar. Select View: All rows as HTML or export all results to a file.

Export structural variants

You can use BioMart to query variants, not just genes. (Make sure you use the right Datasets.)

(a) Export the study accession, source name, chromosome, sequence region start and end (in bp) of human structural variations (SV) on chromosome 1, starting at 130,408 and ending at 210,597.

(b) In a new BioMart query, find the alleles, phenotype descriptions, and associated genes for the human SNPs rs566014072 and rs754099015. Can you view this same information in the Ensembl browser?

(a) Choose Ensembl Variation and Human Structural Variants (GRCh38).

Filters: Region: Chromosome 1, Base pair start: 130408, Base pair end: 210597

Count shows 87 structural variants.

Attributes: Structural Variation (SV) Information: DGVa Study Accession and Source Name, Structural Variation (SV) Location: Chromosome/scaffold name, Chromosome/scaffold position start (bp) and Chromosome/scaffold position end (bp).

(b) Choose Ensembl Variation and Human Short Variation (SNPs and indels) (GRCh38).

Filters: Filter by Variation name enter: rs566014072, rs754099015

Attributes: Variant Name, Variant Alleles, Phenotype description and Associated gene.

You can view this same information in the Ensembl browser. Click on one of the variation IDs (names) in the result table. The variation tab should open in the Ensembl browser. Click Phenotype Data.

Find genes associated with array probes

Forrest et al performed a microarray analysis of peripheral blood mononuclear cell gene expression in benzene-exposed workers (Environ Health Perspect. 2005 June; 113(6): 801–807). The microarray used was the human Affymetrix U133A/B (also called U133 plus 2) GeneChip. The top 25 up-regulated probe-sets were:

207630_s_at 221840_at 219228_at 204924_at 227613_at 223454_at 228962_at 214696_at 210732_s_at 212370_at 225390_s_at 227645_at 226652_at 221641_s_at 202055_at 226743_at 228393_s_at 225120_at 218515_at 202224_at 200614_at 212014_x_at 223461_at 209835_x_at 213315_x_at

(a) Retrieve for the genes corresponding to these probe-sets the Ensembl Gene and Transcript IDs as well as their HGNC symbols and descriptions.

(b) In order to analyse these genes for possible promoter/enhancer elements, retrieve the 2000 bp upstream of the transcripts of these genes.

(c) In order to be able to study these human genes in mouse, identify their mouse orthologues. Also retrieve the genomic coordinates of these orthologues.

(a) Click New. Choose the ENSEMBL Genes database. Choose the Human genes (GRCh38) dataset.

Click on Filters in the left panel. Expand the GENE section by clicking on the + box. Select Input microarray probes/probesets ID list - AFFY HG U133 Plus 2 probe ID(s) and enter the list of probeset IDs in the text box (either comma separated or as a list).

Count shows 26 genes match this list of probes.

Click on Attributes in the left panel. Select the Features attributes page. Expand the GENE section by clicking on the + box. In addition to the default selected attributes, select Description. Expand the External section by clicking on the + box. Select HGNC symbol from the External References section and AFFY HG U133 Plus 2 probe from the Microarray probes/probesets attributes section.

Click the Results button on the toolbar. Select View All rows as HTML or export all results to a file. Tick the box Unique results only.

Your results should show that the 25 probes map to 26 Ensembl genes.

(b) Don’t change Dataset and Filters – simply click on Attributes.

Select the Sequences attributes page. Expand the SEQUENCES section by clicking on the + box. Select Flank (Transcript) and enter 2000 in the Upstream flank text box. Expand the Header information section by clicking on the + box. Select, in addition to the default selected attributes, Gene description and Gene name.

Note: Flank (Transcript) will give the flanks for all transcripts of a gene with multiple transcripts. Flank (Gene) will give the flanks for one possible transcript in a gene (the most 5’ coordinates for upstream flanking).

Click the Results button on the toolbar.

(c) You can leave the Dataset and Filters the same, and go directly to the Attributes section:

Click on Attributes in the left panel. Select the Homologues attributes page. Expand the GENE section by clicking on the + box. Select Gene name. Deselect Ensembl Transcript ID. Expand the ORTHOLOGUES [K-O] section by clicking on the + box. Select Mouse gene stable ID, Mouse chromosome/scaffold name, Mouse chromosome/scaffold start (bp) and Mouse chromosome/scaffold end (bp).

Click the Results button on the toolbar. Select View All rows as HTML or export all results to a file.

Your results should show that for most of the human genes at least one mouse orthologue has been identified.

Exporting paralogues with BioMart

Export a list of all human genes on chromosome 14 which have a paralogue, including the gene names, the last common ancestor and the identity between the genes. How many genes on chromosome 14 have a paralogue?

Go to BioMart and click New. Choose the Ensembl Genes database. Choose the Human genes dataset.

Click on Filters in the left panel. Expand the REGION section by clicking on the + box and select Chromosome/scaffold14. Under MULTI SPECIES COMPARISONS select Homologue filtersParalogous Human Genes: Only. Click the Count button in the side menu.

There are 806 genes on chromosome 14 which have a paralogue.

Click on Attributes in the left panel. Select Homologues from the six options at the top. Expand the GENE section by clicking on the + box. Deselect Transcript stable ID and Transcript stable ID version and select Gene name. Under PARALOGUES select Human paralogue gene stable ID, Human paralogue associated gene name, Paralogue last common ancestor with Human, Paralogue %id. target Human gene identical to query gene and Paralogue %id. query gene identical to target Human gene. Click the Results button on the toolbar. Select View: All rows as HTML or Export all results to a File.

Exporting regulatory features with BioMart

Using the Human Regulatory Features dataset, export a list of all enhancers falling in cytogenetic band q13.2 on chromosome 22 and their activity in Aorta. How many of them are active?

Go to BioMart and click New. Choose the Ensembl Regulation database. Choose the Human Regulatory Features dataset.

Click on Filters in the left panel. Expand the REGULATORY FEATURES section by clicking on the + box and select the following:

Click on Attributes in the left panel. Select Chromosome/scaffold name, Start (bp), End (bp), Feature type, Regulatory stable ID, Activity and Epigenome name. Click the Results button to see the results table. Select View: All and choose to see Unique results only.

There is only one enhancer active in aorta in this cytogenetic band: ENSR00001239875.

Exporting histone modification sites with BioMart

Using the Human Regulatory Evidence dataset, export a list of all H3K9me3 modified loci on chromosome Y in Aorta. What is the source of this evidence?

Go to BioMart and click New. Choose the Ensembl Regulation database. Choose the Human Regulatory Evidence dataset.

Click on Filters in the left panel. Expand the REGULATORY EVIDENCE section by clicking on the + box and select Chromosome - Y, Feature Type – H3K9me3, and Epigenome – aorta.

Click on Attributes in the left panel. Select Chromosome/scaffold name, Start (bp), End (bp), Feature type, Epigenome name and Project name. Click the Results button to see the results table. Select View: All rows as HTML or Export all results to a File.

This data comes from the Roadmap Epigenomics.