Ensembl TrainingEnsembl Home

Ensembl Browser Workshop - EuroFAANG GENE-SWitCH

Course Details

Lead Trainer
Louisse Paola Mirabueno
Event Dates
2023-01-31 until 2023-02-01
Location
  Virtual: EMBL-EBI, Hinxton
Description
Work with the Ensembl Outreach team to get to grips with the Ensembl browser, accessing and analysing Chicken and Pig genomic data.
Survey
 Ensembl Browser Workshop - EuroFAANG GENE-SWitCH Feedback Survey

Demos and exercises

Ensembl species

The front page of Ensembl is found at ensembl.org. It contains lots of information and links to help you navigate Ensembl:

At the top left you can see the current release number and what has come out in this release. To access old releases, scroll to the bottom of the page and click on View in archive site.

Click on the links to go to the archives. Alternatively, you can jump quickly to the correct release by putting it into the URL, for example e98.ensembl.org jumps to release 98.

Click on View full list of all species.

Click on the common name of your species of interest to go to the species homepage. We’ll click on Chicken.

Here you can see links to example pages and to download flatfiles. To find out more about the genome assembly and genebuild, click on More information and statistics.

Here you’ll find a detailed description of how to the genome was produced and links to the original source. You will also see details of how the genes were annotated.

Chicken assembly

When was the current Gallus gallus genome assembly submitted and by whom?

Select Chicken from the drop down species list, or click on View full list of all Ensembl species, then choose Chicken from the list to go to the species homepage. Click on on More information and statistics.

The bGalGal1.mat.broiler.GRCg7b assembly was submitted by Vertebrate Genomes Project on January 2021.

Pig species data

(a) How many coding and non-coding genes does pig have?

(b) When was the current Sus scrofa genome assembly produced and by whom?

(a) Select Pig from the drop down species list, or click on View full list of all Ensembl species, then choose Pig from the list to go to the species homepage. Click on More information and statistics.

Pig has 22,063 coding genes and 13,154 non coding genes.

(b) The Sscrofa11.1 assembly of the pig genome was produced in January 2017 by the Swine Genome Sequencing Consortium (SGSC).

Exploring genomic regions

Start at the Ensembl front page, ensembl.org. You can search for a region by typing it into a search box, but you have to specify the species.

Type (or copy and paste) chicken 4:53544500-53598000 into either search box.

Press Enter or click Go to jump directly to the Region in detail Page.

Click on the button to view page-specific help. The help pages provide text, labelled images and, in some cases, help videos to describe what you can see on the page and how to interact with it.

The Region in detail page is made up of three images, let’s look at each one in detail.

  1. The first image shows the chromosome:

You can jump to a different region by dragging out a box in this image. Drag out a box on the chromosome, a pop-up menu will appear.

If you wanted to move to the region, you could click on Jump to region (### bp). If you wanted to highlight it, click on Mark region (###bp). For now, we’ll close the pop-up by clicking on the X on the corner.

  1. The second image shows a 1Mb region around our selected region. This view allows you to scroll back and forth along the chromosome.

You can also drag out and jump to or mark a region.

Click on the X to close the pop-up menu.

Click on the Drag/Select button to change the action of your mouse click. Now you can scroll along the chromosome by clicking and dragging within the image. As you do this you’ll see the image below grey out and two blue buttons appear. Clicking on Update this image would jump the lower image to the region central to the scrollable image. We want to go back to where we started, so we’ll click on Reset scrollable image.

The third image is a detailed, configurable view of the region.

Click on the Drag/Select option at the top or bottom right to switch mouse action. On Drag, you can click and drag left or right to move along the genome, the page will reload when you drop the mouse button. On Select you can drag out a box to highlight or zoom in on a region of interest.

With the tool set to Select, drag out a box around an exon and choose Mark region.

The highlight will remain in place if you zoom in and out or move around the region. This allows you to keep track of regions or features of interest.

We can edit what we see on this page by clicking on the blue Configure this page menu at the left.

This will open a menu that allows you to change the image.

You can put some tracks on in different styles; more details are in this FAQ: http://www.ensembl.org/Help/Faq?id=335.

Let’s add some tracks to this image. Add:

  • All variants on genotyping chips - short variants (SNPs and indels) - Normal

Now click on the tick in the top left hand to save and close the menu. Alternatively, click anywhere outside of the menu. We can now see the tracks in the image.

We can also change the way the tracks appear by hovering over the track name then the cog wheel to open a menu. We can move tracks around by clicking and dragging on the bar to the left of the track name.

Now that you’ve got the view how you want it, you might like to show something you’ve found to a colleague or collaborator. Click on the Share this page button to generate a link. Email the link to someone else, so that they can see the same view as you, including all the tracks you’ve added. These links contain the Ensembl release number, so if a new release or even assembly comes out, your link will just take you to the archive site for the release it was made on.

To return this to the default view, go to Configure this page and select Reset configuration at the bottom of the menu.

Exploring a genomic region in pig

(a) Go to the region from 8,805,953 to 8,858,418 on pig chromosome 11.

(b) Configure this page to turn on the Tandem repeats (TRF) track in this view. What is this track? How many TRF overlap this region?

(c) Create a Share link for this display. Email it to your neighbour.

(d) Export the genomic sequence of the region you are looking at in FASTA format.

(e) Turn off all tracks you added to the Region in detail page.

(a) Go to the Ensembl homepage.

Select Search: Pig and type 11:8,805,953-8,858,418 in the text box (or alternatively leave the Search drop-down list like it is and type Pig 11:8805953-8858418 in the text box). Click Go.

(b) Click Configure this page in the side menu (or on the cog wheel icon in the top left hand side of the bottom image).

Type TRF in the Find a track text box. Select Tandem repeats (TRF). Click on the (i) button to find out more information about this track.

The Tandem Repeats Finder track locates adjacent copies of a pattern of nucleotides.

Save and close the new configuration by clicking on ✓ (or anywhere outside the pop-up window).

There are 19 TRF that overlap this region.

(c) Click Share this page in the side menu. Select the link and copy. Get your neighbour’s email address and compose an email to them, paste the link in and send the message.

When you receive the link from them, open the email and click on your link. You should be able to view the page with the new configuration and data tracks they have added to in the Location tab.

(d) Click Export data in the side menu. Leave the default parameters as they are. Click Next>. Click on Text.

Note that the sequence has a header that provides information about the genome assembly (primary_assembly:Sscrofa11.1), the chromosome, the start and end coordinates and the strand. For example:

>primary_assembly:Sscrofa11.1:11:8805953-8858418:1

(e) Click Configure this page in the side menu. Click Reset configuration. Click.

Exploring a genomic region in chicken

(a) Go to the region from 38,111,022-38,265,293 on chicken chromosome 5. How many contigs make up this portion of the assembly (contigs are contiguous stretches of DNA sequence that have been assembled solely based on direct sequencing information)?

(b) Zoom in on ESRRB gene.

(c) Turn on the RefSeq GFF3 annotation track as Expanded with labels.

(d) Save this image in PDF format.

(a) Go to the Ensembl homepage.

Select Search: Chicken and type 5:38111022-38265293 in the text box (or alternatively leave the Search drop-down list like it is and type Chicken 5:38111022-38265293 in the text box). Click Go.

This genomic region is made up of one contig indicated by the dark blue coloured bar in the Contigs track.

(b) Drag a box around the ESRRB gene and click on Jump to region.

(c) Click on Configure this page on the left. Type RefSeq GFF3 annotation in the Find a track box. Turn the RefSeq GFF3 annotation track on and select Expanded with labels style. Save and close the menu.

(d) Click on the Export this image icon above the image and then on the Download button to download the image in PDF format.

Genes and Transcripts

Demo: The gene tab

If you click on any one of the transcripts in the Region in detail image, a pop-up menu will appear, allowing you to jump directly to that gene or transcript.

Another way to go to a gene of interest is to search directly for it.

We’re going to look at the pig NSDHL gene.

From ensembl.org, type NSDHL into the search bar and click the Go button. You will get a list of hits with the human gene at the top.

Where you search for something without specifying the species, or where the ID is not restricted to a single species, the most popular species will appear first, in this case, human, mouse and zebrafish appear first. To find the pig gene, we should use the Restrict species to: option on the left and select Pig from … 212 more species ….

You will see links to the NSDHL gene in a number of pig breeds. We want the gene in the reference pig.

Click on the gene name or Ensembl ID for the reference pig._ The Gene tab should open:

Let’s walk through some of the links in the left hand navigation column. How can we view the genomic sequence? Click Sequence at the left of the page.

The sequence is shown in FASTA format. Take a look at the FASTA header:

Exons are highlighted within the genomic sequence. Variants can be added with the Configure this page link found at the left. Click on it now.

Once you have selected changes (in this example, Show variants and Line numbering) click at the top right.

You can download this sequence by clicking in the Download sequence button above the sequence:

This will open a dialogue box that allows you to pick between plain FASTA sequence, or sequence in RTF, which includes all the coloured annotations and can be opened in a word processor. This button is available for all sequence views.

To find out what the protein does, have a look at GO terms from the Gene Ontology consortium (www.geneontology.org). There are three pages of GO terms, representing the three divisions in GO: Biological process (what the protein does), Cellular component (where the protein is) and Molecular function (how it does it). Click on GO: Biological process to see an example of the GO pages.

Can our gene be found in other databases? Go up the left-hand menu to External references:

This contains links to the gene in other projects, such as NCBI gene (formerly Entrezgene), and papers where this sequence is published.

Demo: The transcript tab

Let’s now explore one splice isoform. Click on Show transcript table at the top.

Have a look at the largest one, NSDHL-201.

If we were to only choose one transcript to analyse, we would choose this one because it has the Ensembl canonical flag.

Click on the ID, ENSSSCT00000048661.3.

You are now in the Transcript tab for NSDHL-201. The left hand navigation column provides several options for the transcript NSDHL-201.

For detailed information on the support for this transcript, click on Supporting evidence.

Click on the Exons link.

You may want to change the display (for example, to show more flanking sequence, or to show full introns). In order to do so click on Configure this page and change the display options accordingly.

Now click on the_ cDNA_ link to see the spliced transcript sequence.

UnTranslated Regions (UTRs) are highlighted in dark yellow, codons are highlighted in light yellow, and exon sequence is shown in black or blue letters to show exon divides. Sequence variants are represented by highlighted nucleotides and clickable IUPAC codes are above the sequence.

Next, follow the General identifiers link at the left.

This page shows information from other databases such as ENA, UniProtKB, Reactome and others, that match to the Ensembl transcript and protein.

Now click on Protein summary to view domains from Pfam, Superfamily, PANTHER, and more.

Clicking on Domains & features shows a table of this information.

Exploring the chicken MYH9 gene

(a) Find the chicken MYH9 (myosin, heavy chain 9, non-muscle) gene, and go to the Gene tab.

  • On which chromosome and which strand of the genome is this gene located?
  • How long is the protein it encodes?

(b) What are some functions of MYH9 according to the Gene Ontology consortium? Have a look at the GO pages for this gene.

(c) In the transcript table, click on the transcript ID for MYH9-209, and go to the Transcript tab.

  • How many exons does it have?
  • Are any of the exons completely or partially untranslated?
  • Is there an associated sequence in UniProt? Have a look at the General identifiers for this transcript.

(d) Are there microarray (oligo) probes that can be used to monitor ENSGALT00010036169.1 expression?

(a) Go to the Ensembl homepage.

Select Search: Chicken and type MYH9. Click Go.

Click on either the Ensembl ID ENSGALG00010015031 or the gene name MYH9 for Chicken Breed: reference.

  • Chromosome 1 on the forward strand.
  • The transcript ENSGALT00010036169.1 and it codes for a protein of 1,960 amino acids.

(b) The Gene Ontology project (http://www.geneontology.org/) maps terms to a protein in three classes: biological process, cellular component, and molecular function. Meiotic spindle organisation, cell morphogenesis, and angiogenesis are some of the roles associated with MYH9.

(c) Click on ENSGALT00010036169.1

  • It has 41 exons. This is shown in the Transcript summary or in the left hand side menu Exons.
  • Click on the Exons link in this side menu. Exon 1 is completely untranslated, and exons 2 and 41 are partially untranslated (UTR sequence is shown in orange). You can also see this in the cDNA view if you click on the cDNA link in the left side menu.
  • A0A1D5PM19 from UniProt matches the translation of the Ensembl transcript. Click on A0A1D5PM19 to go to UniProt, or click align for the alignment.

(e) Click on Oligo probes in the side menu.

There are probes from Affy and Agilent that can be used to monitor expression of this transcript.

Variation

In any of the sequence views shown in the Gene and Transcript tabs, you can view variants on the sequence. You can do this by clicking on Configure this page from any of these views.

Let’s take a look at the Gene sequence view for MCM6 in chicken. Search for MCM6 and go to the Sequence view.

If you can’t see variants marked on this view, click on Configure this page and select Show variants: Yes and show links.

Find out more about a variant by clicking on it.

You can add variants to all other sequence views in the same way.

You can go to the Variation tab by clicking on the variant ID. For now, we’ll explore more ways of finding variants.

To view all the sequence variations in table form, click the Variant table link at the left of the gene tab.

You can filter the table to only show the variants you’re interested in. For example, click on Consequences: All, then select the variant consequences you’re interested in.

You can also filter by SIFT, or click on Filter other columns for filtering by other columns such as Evidence or Class.

The table contains lots of information about the variants. You can click on the IDs here to go to the Variation tab too.

You can also see the phenotypes associated with a gene. Click on Phenotype in the left hand menu.

Let’s have a look at variants in the Location tab. Click on the Location tab in the top bar.

Click on Configure this page and open Variation from the left-hand menu.

There are various options for turning on variants. You can turn on variants by source, by frequency, presence of a phenotype or by individual genome they were isolated from. You can also turn on QTLs, which cover a locus without being associated with a specific variant. Turn on the following variation tracks.

  • All variants on genotyping chips - short variants (SNPs and indels)
  • Phenotype annotations (QTLs)

Click on a variant to find out more information. It may be easier to see the individual variants if you zoom in.

Let’s have a look at a specific variant, which happens to fall within the MCM6 gene: rs14625781.

The easiest way to find this variant is if we put rs14625781 into the search box. Click through to open the Variation tab.

The icons show you what information is available for this variant. Click on Genes and regulation, or follow the link at the left.

This variant is found in three transcripts of the MCM6 gene, and is missense in two. SIFT predicts that it is unlikely to affect protein function of either (Tolerated).

Let’s look at population genetics. Either click on Explore this variant in the left hand menu then click on the Population genetics icon, or click on Population genetics in the left-hand menu.

We can see data from EVA study PRJEB44919 showing the frequency of the alleles and genotypes. We can see what animals these genotypes were actually observed in by going to Sample genotypes.

Click on Phylogenetic context to see the variant in other species.

We can see that other birds also have the C alleles as a reference whereas Anolis_carolinensis has an A allele.

Exploring a SNP in chicken

(a) Find the page with information for the chicken SNP rs10731268.

(b) What gene(s) does rs10731268 fall within? What is its effect?

(c) Have any papers been written mentioning rs10731268? What are they about?

(d) What allele is at this position in other birds? What is the likely ancestral allele?

(a) Go to the Ensembl homepage.

Type rs10731268 in the Search box, then click Go. Click on rs10731268.

(b) Click on Genes and Regulation in the side menu (or the Genes and Regulation icon).

rs10731268 falls within 2 genes: ENSGALG00010028562 and ENSGALG00010028568 (HGNC: MLLT1). This variant has a missense consequence in seven transcripts of the ENSGALG00010028562 gene, and downstream gene variant consequence in three transcripts of the ENSGALG00010028568 (HGNC: MLLT1) gene.

(c) Click on Citations in the left hand side menu.

This variant is mentioned in the paper ‘Identification and characterization of genes that control fat deposition in chickens’ from 2013 by D’Andre et al. Click on the PubMed ID 24206759 to go to the paper.

(d) Click on Phylogenetic Context in the side menu. Select Alignment: 17 sauropsids EPO and click Go.

Japanese quail, Duck, Golden Eagle, Common canary and Zebra finch all have an A in this position. This suggests that A may be the ancestral allele.

Exploring a variant in pig

The human gene MC4R has been associated with obesity. The SNP rs81219178 has been identified as a variant in the pig MC4R gene.

(a) What is the amino acid change caused by rs81219178 in MC4R of the pig? Is the change likely to alter the protein function?

(b) How many transcripts does this variant affect? What are the consequences of this variant?

(a) Go to the Ensembl homepage.

Type rs81219178 in the Search box, then click Go.

Click on rs81219178 (Pig Variant, Breed: reference).

Click on Genes and regulation in the left-hand menu or on the icon.

The variant causes a D->N amino acid change (Aspartic acid -> Asparagine). The SIFT score of 0.01 predicts that this change will have a deleterious effect on the protein.

(b) This variant affects one transcript (ENSSSCT00000091644.1) of ENSSSCG00000051798 gene and it has the missense consequence.

VEP

We have identified seven variants in pig:
rs319195925, rs80805426, rs81267388, rs80854621, rs711163915, rs321793337, rs792403417

We will use the Ensembl VEP to determine:

  • If the variants have been annotated in Ensembl already
  • If genes are affected by the variants

Go to the front page of Ensembl and click on Variant Effect Predictor in the Tools section or click on VEP in the top header.

This page contains information about the VEP, including a link for downloading the script version of the tool. Click on the Launch VEP button to open the input form.

Lets input the variants data in VCF format:
Chromosome Position Name Reference Alternative

Put the following into the Input data box:

9 9580742 rs319195925 T C
1 213701082 rs80805426 C A
15 83361856 rs81267388 A G
1 159538854 rs80854621 A G
14 50574184 rs711163915 C A
14 50571223 rs321793337 G A
14 50571474 rs792403417 C T

The VEP will detect automatically that the data is in VCF format.

There are further options that you can choose for your output. These are categorised as Identifiers, Variants and frequency data, Additional annotation, Predictions, Filtering options and Advanced options. Let’s open all menus and take a look.

Hover over the options to see definitions.

When you have selected everything you need, scroll right to the bottom and click Run.

The display will show you the status of your job. It will say Queued, then automatically switch to Done when the job is done, you do not need to refresh the page. You can save, edit, share or delete your job at this time. If you have submitted multiple jobs, they will all appear here.

Click on View Results once your job is done.

In your results you will see a graphical and table summary of the data as well as a table with the detailed results.

VEP for chicken data

We have identified a few variants associated with body size in chicken (bGalGal1.mat.broiler.GRCg7b):

chr 6, genomic coordinate 23650222, alleles A/C, forward strand
chr 6, genomic coordinate 23645685, alleles C/A, forward strand
chr 1, genomic coordinate 51237121, alleles C/T, forward strand

(a) Which genes and transcripts do these variants map to?

(b) What are the consequence terms for these variants?

(c) Which regulatory feature is affected by the variants?

Go to the Variant Effect Predictor (VEP) under Tools on the top banner of any Ensembl page.

Copy the following into the Paste data text box: 6 23650222 23650222 A/C + var1, 6 23645685 23645685 C/A + var2, 1 51237121 51237121 C/T + var3,

Note that this is the Ensembl default format (chr start end reference/alternate alleles). For additional formats accepted by VEP, have a look here: http://www.ensembl.org/info/docs/tools/vep/vep_formats.html

Click Run.

(a) In the Results table, you’ll see that the variants fall into three genes.

(b) The consequence terms are listed in the Consequence column and Consequences (all) chart and include intron_variant, regulatory_region_variant, upstream_gene_variant and downstream_gene_variant.

(c) Variant 3 at 1:51237121-51237121 with T allele affects regulatory feature ENSR00000006264 (promoter).

Comparative Genomics

Let’s look at the homologues of the pig BRCA2 gene. Search for the gene and go to the Gene tab.

Click on Gene tree to display the current gene in the context of a phylogenetic tree used to determine orthologues and paralogues.

You can change the gene tree display by using the View options below the image, the Configure this page menu as well as menus for individual nodes, which you can open by clicking on the nodes. Grey funnels indicate collapsed nodes. You can expand them by clicking on the node and selecting expand this sub-tree from the pop-up menu.

You can download the tree in a variety of formats. Click on the download icon in the bar at the top of the image to get a pop-up window where you can choose your format.

You can look at homologues in the Orthologues and Paralogues pages, which can be accessed from the left-hand menu. If there are no orthologues or paralogues, then the option will be greyed out. Paralogues is greyed out for BRCA2 indicating that there are no paralogues.

Click on Orthologues to see the available orthologues.

Choose to see only Rodent and related species orthologues by selecting the box. The table below now only shows details of these orthologues. Let’s look at mouse (Mus musculus).

Links from the orthologue allow you to go to alignments of the orthologous proteins and cDNAs. Click on View Sequence Alignments, and then View Protein Alignment in the pop-up menu for the mouse orthologue.

Let’s look at some of the comparative genomics views in the Location tab. Go to the region 15:81873000-82000800 in pig, which contains the HoxD cluster which is involved in limb development and is highly conserved between species.

You can turn on conservation scores and constrained elements. Click on Configure this page, then Comparative genomics and turn on the tracks for Constrained elements for 16 pig breeds EPO-Extended and Conservation score for 16 pig breeds EPO-Extended. Save and close the menu.

You can now see the conservation scores in pale pink. These were used to determine the peaks indicated in the constrained elements track in dark pink. This track indicates regions of high conservation between species, considered to be “constrained” by evolution.

We can also look at individual species comparative genomics tracks in this view by clicking on Configure this page.

Select BLASTz/LASTz alignments from the left-hand menu to choose alignments between closely related species. Turn on the alignments for_Human_ and Cow in Normal. Save and close the menu.

We can also look at the alignment between species or groups of species as text. Click on Alignments (text) in the left hand menu.

Select Cow from the alignments list then click Go.

You will see a list of the regions aligned, followed by the sequence alignment. Exons are shown in red.

To compare with both contigs visually, go to Region comparison.

To add species to this view, click on the blue Select species or regions button. Choose Cow from the list then close the menu.

You can configure this view for both species. Click on configure this page and look in the top left of the menu.

The drop down allows you to configure each species separately.

We can view large scale syntenic regions from our chromosome of interest. Click on Synteny in the left hand menu.

Orthologues and gene trees for the chicken BRAF gene

(a) How many orthologues are predicted for the chicken BRAF in sauropsida? How much sequence identity does the Anole lizard (Anolis carolinensis) protein have to the chicken one? Click on the Alignment link next to the Ensembl identifier column to view a protein alignment in Clustal format.

(b) Go to the orthologue in human. Is there a genomic alignment between human and chicken? Is there a gene for both species in this region?

(a) Go to Ensembl homepage, choose chicken from the drop-down list and search for BRAF. Click through to the Gene tab view.

On the Gene tab, click on Orthologues at the left side of the page to see all the orthologous genes.

There are 1:1 orthologues in 25 sauropsida and 1:many in two.

The percentage of identical amino acids in the Anole lizard protein (the orthologue) compared with the gene of interest. i.e. chicken BRAF (the target species/gene) is 81.22%. This is known as the Target %ID. The identity of the gene of interest (chicken BRAF) when compared with the orthologue (Anole lizard BRAF, the query species/gene) is 91.42% (the query %ID).

(b) Go to the orthologues page and click on the human orthologue to open the gene tab.

Click Genomic alignments at the left. Then select Alignment: Chicken (Gallus gallus) – lastz and click Go. Choose Block 1 to get the largest block of aligned sequence.

The red sequence is present in exons, so there is a gene in both species in this region. You can find where the start and stop codons are located if you configure this page and select START/STOP codons.

(Note: To see a blue line connecting homologous genes in the Region Comparison view page, click on configure this page and under Comparative features select join genes. Zoom out on the location view to see blue lines connecting all the homologous genes between human and chicken genes in that region).

Whole genome alignments

Go to www.ensembl.org to find the DBH gene on the reference pig genome (Sscrofa 11.1).

( a) Go to the Location page for this gene. View the Alignments (image) and Alignments (text) for the 16 pig breeds EPO-Extended. Do all the birds show a gene in these alignments?

(b) Export the alignments (as Clustal).

(c) Click on the Region in detail link at the left and turn on the tracks for multiple alignments and conservation score for the 16 pig breeds EPO-Extended by configuring the page.

Turn on the Constrained elements for the 16 pig breeds EPO-extended and compare the two tracks: are there any differences?

What is the difference between the multiple alignment track and the Constrained elements already turned on by default? Which regions of the gene, do most of the constrained element blocks match up to? How can you find more information on how the constrained elements track was generated?

(a) Start in the Location tab for DBH (ENSSSCG00000005742). Click on Alignments (Image) at the left.

Click on Select alignment -> Multiple alignments -> 16 pig breeds EPO-Extended then close the menu.

All 12 big breeds as well as cow, horse and sheep have an alignment at this region. This can also be seen in the Alignments (text) page, where the exons are highlighted in red.

(b) You can export the alignments from either Alignments (images) or Alignments (text) pages. Click on the blue Download alignments button at the left, and choose Clustal from the list.

(c) Click on Region in detail in the left hand menu. Turn on the following tracks by configuring the page: Multiple alignments and Conservation score for 16 pig breeds EPO-extended. Both tracks are under the Comparative genomics menu.

The 16 big breeds EPO-extended track just shows that the whole region for the dbh gene can be aligned among the big breeds and related species. The Constrained elements and Conservation score tracks show where the conserved sequence is located in the alignment.

Higher conservation regions match up with exonic regions (exons tend to be highly conserved) of the gene. Note that there are intronic regions that seem to be fairly conserved across the species available.

Click on the track name and the (i) information button to read more about constrained elements (or any other data track).

BioMart

Follow these instructions to guide you through BioMart to answer the following query:

You have three questions about a set of chicken genes:
ESPN, MYH9, USH1C, CISD2, THRB, WHRN
(these are HGNC gene symbols. More details on the HUGO Gene Nomenclature Committee can be found on https://www.genenames.org/)

  1. What are the NCBI Gene IDs for these genes?
  2. Are there associated functions from the GO (gene ontology) project that might help describe their function?
  3. What are their cDNA sequences?

Click on BioMart in the top header of the Ensembl website or go to BioMart directly by visiting https://www.ensembl.org/biomart/martview.

You cannot choose any filters or attributes until you’ve chosen your dataset. Your dataset is the data type you’re working with. In this case we’re going to choose genes, so pick Ensembl Genes then Chicken genes from the drop-downs.

Now that you’ve chosen your dataset, the filters and attributes will appear in the column on the left. You can pick these in any order and the options you pick will appear.

Click on Filters on the left to see the available filters appear on the main page. You’ll see that there are loads of categories of Filters to choose from. You can expand these by clicking on them. For our query, we’re going to expand GENE.

Our input data is a list of identifiers, so we’re going to use the Input external references ID list filter. This allows us to input a list of identifiers from different databases. We need to choose what kind of identifier we’re using, so that BioMart can look up the right column in a data table. You can pick these from a drop-down list, which lists the type of identifier with an example of how it looks. For our query, we have a list of gene names, so we need to pick Gene Name(s).

To check if the filters have worked, you can use the Count button at the top left, which will show you how many genes have passed the filter. If you get 0 or another number you don’t expect, this can help you to see if your query was effective.

To choose the attributes, expand this in the menu. There are six categories for chicken gene attributes. These categories are mutually exclusive, you cannot pick attributes from multiple categories. This means that we need to do two separate queries to get our GO terms and NCBI IDs, and to get our cDNA sequences.

The Ensembl gene and transcript IDs, with and without version numbers are selected by default. The selected attributes are also listed on the left.

We can choose the attributes we want by clicking on them. For our query, we’re going to select:

  • GENE
    • Gene Name
  • EXTERNAL
    • NCBI gene ID
    • GO term accession
    • GO term name
    • GO term definition

We need to select the Gene Name in order to get back our original input, as this is not returned by default in BioMart. The order that you select the attributes in will define the order that the columns appear in in your output table.

You can get your results by clicking on Results at the top left.

The results table just gives you a preview of the first ten lines of your query. This allows the results to load quickly, so that if you need to make any changes to your query, you don’t waste any time. To see the full table you can click on View ## rows. You can also export the data to an xls, tsv, csv or html file. For large queries, it is recommended that you export your data as Compressed web file (notify by email), to ensure your download is not disrupted by connection issues.

You can see multiple rows per gene in your input list, because there are multiple transcripts per gene and multiple GO terms per transcript.

To get the cDNA sequences, go back to the Attributes then select the category Sequences and expand SEQUENCES.

When you select the sequence type, the part of the transcript model you’ve chosen will be highlighted in the grpahic.

Choose cDNA sequences, then expand HEADER INFORMATION to add Gene Name to the header. Then hit Results again.

For more details on BioMart, have a look at this publication: Kinsella RJ, Kähäri A, Haider S, et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database: the Journal of Biological Databases and Curation. 2011; 2011:bar030. DOI: 10.1093/database/bar030. PMID: 21785142; PMCID: PMC3170168.

BioMart: Convert IDs

BioMart is a very handy tool when you want to convert IDs from different databases. The following is a list of 27 IDs of Sus scrofa proteins from the NCBI RefSeq database: NP_001116455,NP_001231885,NP_001230616,NP_001231413,NP_001231746,NP_999129,NP_001231602,NP_001177096,NP_001231419,NP_001230512, NP_001231165,NP_001167636,NP_001172069,NP_001011509,NP_999191,NP_001231786,NP_001231468,NP_001121951,NP_001230557,NP_999413

Generate a list that shows to which Ensembl Gene IDs and to which gene names these RefSeq IDs correspond. Do these 27 proteins correspond to 27 genes?

Click New. Choose the ENSEMBL Genes database. Choose the Pig genes (Sscrofa11.1) dataset.

Click on Filters in the left panel. Expand the GENE section by clicking on the + box. Select Input external references ID list - RefSeq peptide ID(s) and enter the list of IDs in the text box (either comma separated or as a list). HINT: You may have to scroll down the menu to see these. Count shows 20 genes.

Click on Attributes in the left panel. Select the Features attributes page. Expand the GENE tab by clicking on the + box. Select Gene name. Expand the EXTERNAL tab. Select RefSeq Peptide ID.

Click the Results button on the toolbar. Select View All rows as HTML or export all results to a file.

BioMart: Finding genes by protein domain

Find chicken proteins with transmembrane domains located on chromosome 9.

As with all BioMart queries you must select the dataset, set your filters (input) and define your attributes (desired output). For this exercise:
Dataset: Ensembl genes in chicken
Filters: Transmembrane proteins on chromosome 9
Attributes: Ensembl gene and transcript IDs and Associated gene names

Go to the Ensembl homepage (https://www.ensembl.org) and click on BioMart at the top of the page. Select Ensembl genes as your database and Chicken genes (bGalGal1.mat.broiler.GRCg7b) as the dataset. Click on Filters on the left of the screen and expand REGION. Change the chromosome to 9. Now expand PROTEIN DOMAINS AND FAMILIES, also under filters, and select Limit to genes …, choosing With Transmembrane helices from the drop-down and select Only. Clicking on Count should reveal that you have filtered the dataset down to 143 genes.

Click on Attributes. Under Features expand GENE. Select Gene name.

Now click on Results. The first 10 results are displayed by default; display all results by selecting All from the drop-down menu above the table.

The output will display the Ensembl gene ID, Ensembl Transcript ID and associated gene names of all proteins with a transmembrane domain on chicken chromosome 9. If you prefer, you can also export as an Excel sheet by using the Export all results to XLS option.

BioMart: Find genes associated with array probes

Here are two affymetrix probeset IDs from my microarray experiment that seem to map uniquely to genes in the chicken genome: Gga.12669.1.S1_at, GgaAffx.7784.1.S1_at

(a) Retrieve for the genes corresponding to these probe-sets the Ensembl Gene and Transcript IDs as well as their gene symbols and descriptions.

(b) In order to analyse these genes for possible promoter/enhancer elements, retrieve the 2000 bp upstream of the transcripts of these genes.

(c) In order to be able to study these chicken genes in duck, identify their duck orthologues. Also retrieve the genomic coordinates of these orthologues.

(a) Click New. Choose the Ensembl Genes database. Choose the Chicken genes dataset.

Click on Filters in the left panel. Expand the GENE section by clicking on the + box. Select Input microarray probes/probesets ID list - AFFY Chicken probe ID(s) and enter the list of probeset IDs in the text box (either comma separated or as a list).

Count shows three genes match this list of probesets.

Click on Attributes in the left panel. Select the Features attributes page. Expand the GENE section by clicking on the + box. In addition to the default selected attributes, select Gene name and Gene description. Expand the EXTERNAL section by clicking on the + box. Select AFFY Chicken probe from the Microarray probes/probesets section.

Click the Results button on the toolbar. Select View All rows as HTML or export all results to a file. Tick the box Unique results only.

Your results should show that the 2 probes map to 2 Ensembl genes.

(b) Don’t change Dataset and Filters – simply click on Attributes.

Select the Sequences category. Expand the SEQUENCES tab by clicking on the + box. Select Flank (Transcript) and enter 2000 in the Upstream flank text box. Expand the HEADER INFORMATION tab by clicking on the + box. Select Gene description and Gene name in addition to the default selected attributes.

Note: Flank (Transcript) will give the flanks for all transcripts of a gene with multiple transcripts. Flank (Gene) will give the flanks for one possible transcript in a gene (the most 5’ coordinates for upstream flanking).

Click the Results button on the toolbar.

(c) You can leave the Dataset and Filters the same, and go directly to the Attributes section:

Click on Attributes in the left panel. Select the Homologues category. Expand the GENE tab by clicking on the + box. Select Gene name. Unselect Transcript stable ID and Transcript stable ID version. Expand the ORTHOLOGUES [A-E] tab by clicking on the + box. Select Duck gene stable ID, Duck chromosomes/scaffold name, Duck chromosome/scaffold start (bp) and Duck chromosome/scaffold end (bp).

Click the Results button on the toolbar. Select View All rows as HTML or export all results to a file.

Your results should show that for each chicken gene, one duck orthologue has been identified.