Ensembl TrainingEnsembl Home

The Ensembl Genome Browser - University College London (UCL)

Course Details

Lead Trainer
Louisse Paola Mirabueno
Event Date
2023-10-10
Location
  B29 – Public Cluster, Foster Court, University College London, UK
Description
Work with the Ensembl Outreach team to get to grips with the Ensembl browser.
Survey
 The Ensembl Genome Browser - University College London (UCL) Feedback Survey

Demos and exercises

Human genes and transcripts

You can find out lots of information about Ensembl genes and transcripts using the browser. If you’re already looking at a region view, you can click on any transcript and a pop-up menu will appear, allowing you to jump directly to that gene or transcript.

Alternatively, you can find a gene by searching for it. You can search for gene names or identifiers, and also phenotypes or functions that might be associated with the genes.

We’re going to look at the human UQCRQ gene. From ensembl.org, type UQCRQ into the search bar and click the Go button. You will get a list of hits with the human gene at the top.

Where you search for something without specifying the species, or where the ID is not restricted to a single species, the most popular species will appear first, in this case, human, mouse and zebrafish appear first. You can restrict your query to species or features of interest using the options on the left.

The gene tab

Click on the gene name or Ensembl ID. The Gene tab should open:

This page summarises the gene, including its location, name and equivalents in other databases. At the bottom of the page, a graphic shows a region view with the transcripts. We can see exons shown as blocks with introns as lines linking them together. Coding exons are filled, whereas non-coding exons are empty. We can also see the overlapping and neighbouring genes and other genomic features.

There are different tabs for different types of features, such as genes, transcripts or variants. These appear side-by-side across the blue bar, allowing you to jump back and forth between features of interest. Each tab has its own navigation column down the left hand side of the page, listing all the things you can see for this feature.

Let’s walk through this menu for the gene tab. How can we view the genomic sequence? Click Sequence at the left of the page.

The sequence is shown in FASTA format. The FASTA header contains the genome assembly, chromosome, coordinates and strand (1 or -1) – this gene is on the positive strand.

Exons are highlighted within the genomic sequence, both exons of our gene of interest and any neighbouring or overlapping gene. By default, 600 bases are shown up and downstream of the gene. We can make changes to how this sequence appears with the blue Configure this page button found at the left. This allows us to change the flanking regions, add variants, add line numbering and more. Click on it now.

Once you have selected changes (in this example, Show variants, 1000 Genomes variants and Line numbering) click at the top right.

You can download this sequence by clicking in the Download sequence button above the sequence:

This will open a dialogue box that allows you to pick between plain FASTA sequence, or sequence in RTF, which includes all the coloured annotations and can be opened in a word processor. If you want run a sequence analysis tool, download as FASTA sequence, whereas if you want to analyse the sequence visually, RTF is best for this. This button is available for all sequence views.

To find out what the protein does, have a look at GO terms from the Gene Ontology consortium. There are three pages of GO terms, representing the three divisions in GO: Biological process (what the protein does), Cellular component (where the protein is) and Molecular function (how it does it). Click on GO: Biological process to see an example of the GO pages.

Here you can see the functions that have been associated with the gene. There are three-letter codes that indicate how the association was made, as well as links to the specific transcript they are linked to.

We also have links out to other databases which have information about our genes and may focus on other topics that we don’t cover, like Gene Expression Atlas or OMIM. Go up the left-hand menu to External references:

Demo: The transcript tab

We’re now going to explore the different transcripts of UQCRQ. Click on Show transcript table at the top.

Here we can see a list of all the transcripts of UQCRQ with their identifiers, lengths, biotypes and flags to help you decide which ones to look at.

If we were to only choose one transcript to analyse, we would choose UQCRQ-203 because it is the MANE Select and Ensembl Canonical. This means it is both 100% identical to the RefSeq transcript NM_014402.5 and both Ensembl and NCBI agree that it is the most biologically important transcript.

Click on the ID, ENST00000378670.8.

You are now in the Transcript tab for UQCRQ-203. We can still see the gene tab so we can easily jump back. The left hand navigation column provides several options for the transcript UQCRQ-203 - many of these are similar to the options you see in the gene tab, but not all of them. If you can’t find the thing you’re looking for, often the solution is to switch tabs.

Click on the Exons link. This page is useful for designing RT-PCR primers because you can see the sequences of the different exons and their lengths.

You may want to change the display (for example, to show more flanking sequence, or to show full introns). In order to do so click on Configure this page and change the display options accordingly.

Now click on the cDNA link to see the spliced transcript sequence with the amino acid sequence. This page is useful for mapping between the RNA and protein sequences, particularly genetic variants.

UnTranslated Regions (UTRs) are highlighted in dark yellow, codons are highlighted in light yellow, and exon sequence is shown in black or blue letters to show exon divides. Sequence variants are represented by highlighted nucleotides and clickable IUPAC codes are above the sequence.

Next, follow the General identifiers link at the left. Just like the External References page in the gene tab, this page shows links out to other databases such as RefSeq, UniProtKB, PDBe and others, this time linked to the transcript or protein product, rather than the gene.

If you’re interested in protein domains, you could click on Protein summary to view domains from Pfam, PROSITE, Superfamily, InterPro, and more. These are all plotted against the transcript sequence, with the exons shown in alternating shades of purple at the top of the page. Alternatively, you can go to Domains & features to see a table of the same information.

You can also see the structure of the protein from the PDB by clicking on PDB 3D Protein model.

This uses LiteMol to show a 3D protein. You can use all the normal controls that you would use with LiteMol, plus plot Ensembl features like Exons and variants onto the structure using the options on the right. We allow you to see the top ten PDB models for this protein, based on coverage and quality scores, you can choose which at the top of the viewer.

Exploring the MYH9 gene in human

  1. In Ensembl, find the human MYH9 (myosin, heavy chain 9, non-muscle) gene and open the Gene tab.
    • On which chromosome and which strand of the genome is this gene located?
    • How many transcripts (splice variants) are there and how many are protein coding?
    • What is the longest protein-coding transcript, and how long is the protein it encodes?
    • Which transcript would you take forward for further study?
  2. Click on Phenotypes at the left side of the page. Are there any diseases associated with this gene, according to Mendelian Inheritance in Man (MIM)?

  3. What are some functions of MYH9 according to the Gene Ontology (GO) consortium? Have a look at the GO: Biological process pages for this gene.

  4. In the transcript table, click on the transcript ID for MYH9-201, and go to the Transcript tab.
    • How many exons does it have?
    • Are any of the exons completely or partially untranslated?
    • Is there an associated sequence in UniProtKB/Swiss-Prot? Have a look at the General identifiers for this transcript.
  5. Are there microarray (oligo) probes that can be used to monitor ENST00000216181 expression?
  1. Select Human from the Species drop-down list and type MYH9. Click Go. Click on MYH9 (Human Gene) in the search results which will send you to the Gene tab.
    • The gene is located on chromosome 22 on the reverse strand.
    • Ensembl has 23 transcripts annotated for this gene, of which 6 are protein-coding.
    • The longest protein-coding transcript is MYH9-215 and it codes for a protein that is 1,981 amino acids long.
    • MYH9-201 is the transcript I would take forward for further study, as it is the MANE Select transcript (for a description, mouse-over the MANE Select flag in the transcript table).
  2. Click on Phenotypes in the left-hand panel to see the associated phenotypes. There is a large table of phenotypes. To see only the ones from MIM, type MIM into the filter box at the top right-hand corner of the table.

    These are some of the phenotypes associated with MYH9 according to MIM: Deafness, Autosomal dominant 17 and Macrothrombocytopenia and granulocyte inclusions with or without nephritis or sensorineural hearing loss. You can click on the records for more information.

  3. The Gene Ontology project maps terms to a protein in three classes: biological process, cellular component, and molecular function. Click on GO: Biological process on the left-hand panel. Angiogenesis, cell adhesion, and protein transport are some of the roles associated with MYH9. All GO terms are associated with a single transcript: ENST00000216181.

  4. Click on ENST00000216181.11 in the transcript table. You should now be on the Transcript tab.
    • It has 41 exons, shown in the Transcript summary.

    Click on the Exons link in the left-hand panel.

    • Exon 1 is completely untranslated, and exons 2 and 41 are partially untranslated (UTR sequence is shown in orange). You can also see this in the cDNA view if you click on the cDNA link in the left side menu.

    Click on General identifiers in the left-hand panel.

    • P35579.247 from UniProt/Swiss-Prot matches the translation of the Ensembl transcript. Click on P35579.247 to go to UniProtKB, or click align for the alignment.
  5. Click on Oligo probes in the left-hand panel.

    Probesets from Affymetrix, Agilent, Codelink, Illumina, and Phalanx OneArray match to this transcript sequence. Expression analysis with any of these probesets would reveal information about the transcript. Hint: this information can sometimes be found in the [ArrayExpress Atlas] (https://www.ebi.ac.uk/biostudies/arrayexpress).

Finding a gene associated with a phenotype

Phenylketonuria is a genetic disorder caused by an inability to metabolise phenylalanine in any body tissue. This results in an accumulation of phenylalanine causing seizures and intellectual disability.

(a) Search for phenylketonuria from the Ensembl homepage and narrow down your search to only genes. What gene is associated with this disorder?

(b) How many protein coding transcripts does this gene have? View all of these in the transcript comparison view.

(c) What is the MIM gene identifier for this gene?

(d) Go to the MANE Select transcript and look at its 3D structure. In the model 2pah, how many protein molecules can you see?

(a) Start at the Ensembl homepage (http://www.ensembl.org).

Type phenylketonuria into the search box then click Go. Choose Gene from the left hand menu.

The gene associated with this disorder is PAH, phenylalanine hydroxylase, ENSG00000171759.

(b) If the transcript table is hidden, click on Show transcript table to see it.

There are six protein coding transcripts.

Click on Transcript comparison in the left hand menu. Click on Select transcripts. Either select all the transcripts labelled protein coding one-by-one, or click on the drop down and select Protein coding. Close the menu.

(c) Click on External references.

The MIM gene ID is 612349.

(d) Open the transcript table and click on the ID for the MANE Select: ENST00000553106.6. Go to PDB 3D protein model in the left-hand menu.

The model 2pah is shown by default. It has two protein molecules in it. You may need to rotate the model to see this clearly.

Variation

In any of the sequence views shown in the Gene and Transcript tabs, you can view variants on the sequence. You can do this by clicking on Configure this page from any of these views.

Let’s take a look at the Gene sequence view for HBB in human. Search for HBB and go to the Sequence view.

If you can’t see variants marked on this view, click on Configure this page and select Show variants: Yes and show links. You may also wish to add a filter to the variants to allow them to load more quickly, we’ll add Filter variants by evidence status: 1000Genomes.

Find out more about a variant by clicking on it.

You can add variants to all other sequence views in the same way.

You can go to the Variation tab by clicking on the variant ID. For now, we’ll explore more ways of finding variants.

To view all the sequence variants in table form, click the Variant table link at the left of the gene tab.

You can filter the table to only show the variants you’re interested in. For example, click on Consequences: All, then select the variant consequences you’re interested in. For display purposes, the table above has already been filtered to only show missense variants.

You can also filter by the different pathogenicity scores and MAF, or click on Filter other columns for filtering by other columns such as Evidence or Class.

The table contains lots of information about the variants. You can click on the IDs here to go to the Variation tab too.

You can also see the phenotypes associated with a gene. Click on Phenotype in the left hand menu.

Open the transcript table and go to HBB-201 ENST00000335295, then click on Haplotypes in the left hand menu.

The Haplotypes view in the transcript tab shows you the actual protein and CDS sequences in 1000 Genomes individuals. This is possible because the 1000 Genomes study has phased genotypes, so we know which alleles occur on which of the chromosome pairs. The table lists all the versions of the protein that occur along with their frequencies, including the reference sequence and sequences with one or more alternative alleles.

Click on one of the haplotypes, we’ll go for 18K>*,​19del{130}, to find out more about it. Here you will see the frequency in the 1000 Genomes subpopulations, the sequence and the 1000 Genomes individuals where this protein is found.

Let’s have a look at variants in the Location tab. Click on the Location tab in the top bar.

Configure this page and open Variation from the left-hand menu.

There are various options for turning on variants. You can turn on variants by source, by frequency, presence of a phenotype or by individual genome they were isolated from. You can also turn on genotyping chips.

Let’s have a look at a specific variant. If we zoomed in we could see the variant rs334 in this region, however it’s easier to find if we put rs334 into the search box. Click through to open the Variation tab.

The icons show you what information is available for this variant. Click on Genes and regulation, or follow the link on the left.

This page illustrates the genes the variant falls within and the consequences on those genes, including pathogenicity predictors. It also shows data from GTEx on genes that have increased/decreased expression in individuals with this variant, in different tissues. Finally, regulatory features and motifs that the variant falls within are shown.

We can also see the variant in the protein structure by clicking on 3D Protein model.

This is a LiteMol viewer, where you can rotate and zoom in on the structure. The variant location is highlighted, so you can see where it lands within the structure.

Let’s look at population genetics. Click on Population genetics in the left-hand menu.

The population allele frequencies are shown by study, including 1000 Genomes and gnomAD. Where genotype frequencies are available, these are shown in the tables.

There are big differences in allele frequencies between populations. Let’s have a look at the phenotypes associated with this variant to see if they are known to be specific to certain human populations. Click on Phenotype Data in the left-hand menu.

This variant is associated with various phenotypes, including sickle cell and malaria resistance. These phenotype associations come from sources including the GWAS catalog, ClinVar, Orphanet and OMIM. Where available, there are links to the original paper that made the association, the allele that is associated with the phenotype and p-values and other statistics.

Human population genetics and phenotype data

The SNP rs1738074 in the 5’ UTR of the human TAGAP gene has been identified as a genetic risk factor for a few diseases. Use Ensembl to answer the following questions:

  1. In which transcripts is this SNP found?

  2. What is the least frequent genotype for this SNP in the Yoruba (YRI) population from the 1000 Genomes phase 3?

  3. What is the ancestral allele? Is it conserved in the 91 eutherian mammals EPO-Extended?

  4. With which diseases is this SNP associated? Are there any known risk (or associated) alleles?

  1. Please note there is more than one way to get this answer. Either go to the Variation table of the human TAGAP gene, and use the Consequence filter to only include 5’UTR variants, or search Ensembl for rs1738074 directly. Once you’re in the Variant tab, click on Genes and regulation in the menu.

    This SNP is found in four transcripts of TAGAP. It is also intronic to eleven non-coding transcripts of TAGAP-AS1 and one non-coding transcript of ENSG00000226032.

  2. Click on Population genetics in the left-hand panel, or click on Explore this variant in the left-hand panel and click the Population genetics icon.

    In Yoruba (YRI), the least frequent genotype is CC at the frequency of 5.6%.

  3. Click on Phylogenetic context in the left-hand panel.

    The ancestral allele is T and it’s inferred from the alignment in primates.

    Click on Select an alignment which will open a pop-up menu. Open Multiple alignments and select 91 eutherian mammals EPO-Extended. Click on Apply at the bottom of the menu to save your settings.

    A region containing the SNP (highlighted in red and placed in the centre) and its flanking sequence are displayed. The T allele is conserved in all but two of the eutherian mammals displayed.

  4. Click Phenotype data in the left-hand panel.

    This variation is associated with multiple sclerosis, celiac disease and white blood cell count. There are known risk alleles for all three diseases and the corresponding P values are provided. The allele A is associated with celiac disease. Note that the alleles reported by Ensembl are T/C. Ensembl reports alleles on the forward strand. This suggests that A was reported on the reverse strand in the original paper. Similarly, one of the alleles reported for Multiple sclerosis is G.

Exploring a SNP in the human genome

The missense variation rs1801133 in the human MTHFR gene has been linked to elevated levels of homocysteine, an amino acid whose plasma concentration seems to be associated with the risk of cardiovascular diseases, neural tube defects, and loss of cognitive function.

  1. Find the page with information for rs1801133.

  2. Is rs1801133 a missense variant in all transcripts of the MTHFR gene? What is the amino acid change?

  3. Why are the alleles for this variation in Ensembl given as G/A and not as C/T, as in the literature?

  4. What is the major allele of rs1801133 in different populations?

  5. In which paper(s) is the association between rs1801133 and homocysteine levels described?

  6. According to the data imported from dbSNP, the ancestral allele for rs1801133 is G. Ancestral alleles in dbSNP are based on a comparison between human and chimp. Does the sequence at this same position in other primates confirm that the ancestral allele is G?

  1. Go to the Ensembl homepage. Type rs1801133 in the search box, then click Go. Click on rs1801133.

  2. Click on Genes and Regulation in the left-hand panel, or click on the Genes and Regulation icon at the top of the page.

    No, rs1801133 is missense variant in eight MTHFR transcripts. Please note that this variant is multialleleic with two alternative alleles - as this table displays one consequence per row, each transcript is listed twice.

The amino acid change is A/V for allele A, and A/G for allele C.

  1. In Ensembl, the alleles of rs1801133 are given as G/A/C because these are the alleles in the forward strand of the genome. In the literature, the alleles are given as C/T/G because the MTHFR gene is located on the reverse strand. The alleles in the actual gene and transcript sequences are C/T/G. In Ensembl, the allele that is present in the reference genome assembly is always put first.

  2. Click on Population genetics in the side menu.

    In all populations but one, the allele G is the major one. The exception is CLM (Colombian in Medellin; 1000 Genomes).

  3. Click on Phenotype Data in the left hand side menu.

    The specific studies where the association was originally described is given in the Phenotype Data table. Links between rs1801133 and homocysteine levels were described in four papers. Click on the pubmed IDs PMID:34707639, PMID:23696881, PMID:20031578 and PMID:23824729 for more details.

  4. Click on Phylogenetic Context in the side menu. Select Alignment: 10 primates EPO and click Apply.

    Gorilla, bonobo, Sumatran orangutan, chimp, macaque, gibbon, vervet, crab-eating macaque and mouse lemur all have a G in this position.

Exploring VNTR in human

Variable number tandem repeats (VNTRs) show high variation in the number of repeats in the population and are commonly used in forensics (DNA fingerprinting) and to study genetic diversity. (a) Go to the region from 3074666 to 3075100 bp on human chromosome 4. Which gene does it overlap? Which exon of this gene falls in this region?

(b) Configure this page to turn on Repeats (low), Simple repeats (Repeats (low)) and Tandem repeats (TRF) tracks in this view. Can you see any repeats in this exon? What tools were used to annotate the repeats according to the track information?

(c) Zoom in on the (CAG)n to see its sequence. How many CAG repeats can you see in the human reference assembly? Does this track overlap any phenotype-associated variants? What is the identifier of this variant?

(d) Go to the variant tab of the phenotype-associated variant. What is the consequence ontology of this variant? Does the reference allele match the number of repeats you have just counted? What is the shortest and longest allele?

(a) Select Search: Human and type 4:3074666-3075100 in the text box (or alternatively type human 4:3074666-3075100 in the text box). Click Go.

Click on the golden transcript falling in this region. You can see it’s exon 1 of 67 of the huntingtin gene (HTT).

(b) Click Configure this page in the side menu then select: Repeats (low), Simple repeats (Repeats (low)) and Tandem repeats (TRF).

There are three tandem repeats in this exon, and two simple repeats (low); (CAG)n and (CCG)n. Click on the track names to find more about the tools used for annotation: RepeatMasker and Tandem Repeats Finder.

(c) Draw with your mouse a box around the (CAG)n repeat. Click on Jump to region in the pop-up menu.

There are 19 CAG repeats in the human reference sequence overlapping rs71180116 indicated by a pink bar in the All phenotype-associated - short variants (SNPs and indels) track.

(d) Click on the rs71180116 ID to go to the variant tab. You can see in the summary page that this variant is classified as an inframe insertion. Either click + to show all of the alleles in the summary page or go to the Genes and regulation table. This variant has many alternative alleles which differ in the number of repeats. The first allele in the expanded Alleles section of the summary page or the first allele in the Codons column in the Genes and regulation table is the reference allele. It is composed of 19 CAG repeats just as in the Region in detail view. The shortest allele has 7 repeats, the longest has 55 repeats.

VEP

We have identified five variants on human chromosome nine, C-> A at 128203516, an A deletion at 128328461, C->A at 128322349, C->G at 128323079 and G->A at 128322917.

We will use the Ensembl VEP to determine:

  • Have my variants already been annotated in Ensembl?
  • What genes are affected by my variants?
  • Do any of my variants affect gene regulation?

Go to the front page of Ensembl and click on the Variant Effect Predictor.

This page contains information about the VEP, including links to download the script version of the tool. Click on Launch VEP to open the input form:

The data is in VCF format:
chromosome coordinate id reference alternative

Put the following into the Paste data box:
9 128328460 var1 TA T
9 128322349 var2 C A
9 128323079 var3 C G
9 128322917 var4 G A
9 128203516 var5 C A

The VEP will automatically detect that the data is in VCF.

There are further options that you can choose for your output. These are categorised as Identifiers, Variants and frequency data, Additional annotations, Predictions, Filtering options and Advanced options. Let’s open all the menus and take a look.

Hover over the options to see definitions.

We’re going to select some options:

  • HGVS, annotation of variants in terms of the transcripts and proteins they affect, commonly-used by the clinical community
  • Phenotypes
  • Protein domains

When you’ve selected everything you need, scroll right to the bottom and click Run.

The display will show you the status of your job. It will say Queued, then automatically switch to Done when the job is done, you do not need to refresh the page. You can edit or discard your job at this time. If you have submitted multiple jobs, they will all appear here.

Click View results once your job is done.

In your results you will see a graphical summary of your data, as well as a table of your results.

The results table is enormous and detailed, so we’re going to go through the it by section. The first column is Uploaded variant. If your input data contains IDs, like ours does, the ID is listed here. If your input data is only loci, this column will contain the locus and alleles of the variant. You’ll notice that the variants are not neccessarily in the order they were in in your input. You’ll also see that there are multiple lines in the table for each variant, with each line representing one transcript or other feature the variant affects.

You can mouse over any column name to get a definition of what is shown.

The next few columns give the information about the feature the variant affects, including the consequence. Where the feature is a transcript, you will see the gene symbol and stable ID and the transcript stable ID and biotype. Where the feature is a regulatory feature, you will get the stable ID and type. For a transcription factor binding motif (labelled as a MotifFeature) you will see just the ID. Most of the IDs are links to take you to the gene, transcript or regulatory feature homepage.

This is followed by details on the effects on transcripts, including the position of the variant in terms of the exon number, cDNA, CDS and protein, the amino acid and codon change, transcript flags, such as MANE, on the transcript, which can be used to choose a single transcript for variant reporting, and pathogenicity scores. The pathogenicity scores are shown as numbers with coloured highlights to indicate the prediction, and you can mouse-over the scores to get the prediction in words. Two options that we selected in the input form are the HGVS notation, which is shown to the left of the image below and can be used for reporting, and the Domains to the right. The Domains list the proteins domains found, and where there is available, provide a link to the 3D protein model which will launch a LiteMol viewer, highlighting the variant position.

Where the variant is known, the ID of the existing variant is listed, with a link out to the variant homepage. In this example, only rsIDs from dbSNP are shown, but sometimes you will see IDs from other sources such as COSMIC. The VEP also looks up the variant in the Ensembl database and pulls back the allele frequency (AF in the table), which will give you the 1000 Genomes Global Allele Frequency. In our query, we have not selected allele frequencies from the continental 1000 Genomes populations or from gnomAD, but these could also be shown here. We can also see ClinVar clinical significance and the phenotypes associated with known variants or with the genes affected by the variants, with the variant ID listed for variant associations and the gene ID listed for gene associations, along with the source of the association.

For variants that affect transcription factor binding motifs, there are columns that show the effect on motifs (you may need to click on Show/hide columns at the top left of the table to display these). Here you can see the position of the variant in the motif, if the change increases or decreases the binding affinity of the motif and the transcription factors that bind the motif.

Above the table is the Filter option, which allows you to filter by any column in the table. You can select a column from the drop-down, then a logic option from the next drop-down, then type in your filter to the following box. We’ll try a filter of Consequence, followed by is then missense_variant, which will give us only variants that change the amino acid sequence of the protein. You’ll notice that as you type missense_variant, the VEP will make suggestions for an autocomplete.

You can export your VEP results in various formats, including VCF. When you export as VCF, you’ll get all the VEP annotation listed under CSQ in the INFO column. After filtering your data, you’ll see that you have the option to export only the filtered data. You can also drop all the genes you’ve found into the Gene BioMart, or all the known variants into the Variation BioMart to export further information about them.

Running CFTR variants through VEP

Resequencing of the genomic region of the human CFTR (cystic fibrosis transmembrane conductance regulator (ATP-binding cassette sub-family C, member 7) gene (ENSG00000001626) has revealed the following variants. The alleles defined in the forward strand:

  • G/A at 7: 117,530,985
  • T/C at 7: 117,531,038
  • T/C at 7: 117,531,068

Use the VEP tool in Ensembl and choose the options to see SIFT and PolyPhen predictions. Do these variants result in a change in the proteins encoded by any of the Ensembl genes? Which gene? Have the variants already been found?

Go to the Ensembl homepage and click on the link Tools at the top of the page. Currently there are nine tools listed in that page. Click on Variant Effect Predictor and enter the three variants as below:

7	117530985	117530985	G/A
7	117531038	117531038	T/C  
7	117531068	117531068	T/C

Note: Variation data input can be done in a variety of formats. See more details about the different data formats and their structure in this VEP documentation page. Click Run. When your job is listed as Done, click View Results.

You will get a table with the consequence terms from the Sequence Ontology project (http://www.sequenceontology.org/) (i.e. synonymous, missense, downstream, intronic, 5’ UTR, 3’ UTR, etc) provided by VEP for the listed SNPs. You can also upload the VEP results as a track and view them on Location pages in Ensembl. SIFT and PolyPhen are available for missense SNPs only. For two of the entered positions, the variations have been predicted to have missense consequences of various pathogenicity (coordinate 117531038 and 117531068), both affecting CFTR. All the three variants have been already annotated and are known as rs1800077, rs1800078 and rs35516286 in dbSNP (databases, literature, etc).

VEP analysis of structural variants in human

We have details of a genomic deletion in a breast cancer sample in VCF format:

13 32307062 sv1 . <DEL> . . SVTYPE=DEL;END=32908738

Use VEP in Ensembl to find out the following information:

1.  How many genes have been affected?

2.  Does the structural variant (SV) cause deletion of any complete transcripts?

3.  Map your variant in the Ensembl browser on the Region in detail view.

  1. Click on VEP at the top of any Ensembl page and open the web interface. Make sure your species is Human. It is good practise to name your VEP jobs something descriptive, such as Patient deletion exercise. Paste the variant in VCF format into the Paste data field and hit Run.

    12 different genes are affected by the SV.

  2. Filter your table by select Consequence is transcript_ablation at the top of the table and click Add.

    Yes, there is deletion of complete transcripts of PDS5B, N4BP2L1, BRCA2, RNY1P4, IFIT1P1, ATP8A2P2, N4BP2L2, N4BP2L2-IT2 and one gene without official symbols: ENSG00000212293.

  3. To view your variant in the browser click on the location link in the results table 13: 32307062-32908738. The link will open the Region in detail view in a new tab. If you have given your data a name it will appear automatically in red. If not, you may need to Configure this page and add it under the Personal data tab in the pop-up menu.

Regulation

We’re going to look for regulatory features in the region of a gene and investigate their activity in different cell types. We’ll start by searching for the gene KPNA2 and jumping to the Location tab. Scroll down to the Region in detail view and zoom out a little to see the gene as well as its flanking regions.

The Regulatory Build track is shown by default.

In this region we can see a number of regulatory features, including a red promoter with light red promoter flanks, cyan CTCF binding sites, yellow enhancers and lilac transcription factor (TF) binding sites (don’t worry if you have zoomed out further or not as far and can see more/less). Refer to the legend at the bottom of the view to see what each of the colours mean.

You can also click on the individual regulatory features to learn more. Click on the red promoter to open a pop-up menu.

Click on the stable ID, ENSR00000097453, to jump to the Regulation tab.

Here, you can find a summary of the activity of the promoter in the different cell types. Scroll down to Summary of Regulatory Aactivity to find out in which cells the promoter is active (the feature displays an active epigenetic signature, which can include evidence of open chromatin), inactive (the region bears no epigenetic modifications from the ones included in the Regulatory Build), poised (the feature displays a epigenetic signature with the potential to be activated) or repressed (the feature is epigenetically suppressed). We can see that this promoter is active in one out of the 118 cell types currently in Ensembl.

Let’s switch back to the Location tab to explore the different regulation tracks that are available. Click on Configure this page and in the pop-up window under the Regulation section, click on Other regulatory regions and enable the Fantom 5, TarBase and Motif features tracks. Close the pop-up window.

The Fantom 5 track displays transcription start site (TSS) and enhancer predictions from the FANTOM5 project.

The TarBase track displays experimentally verified miRNA targets from TarBase.

The Motif features track indicates the positions of transcription factor binding motifs (TFBMs) in black lines/blocks. You can click on individual features to find out more information about the TFBM, including a list of TFs binding at this site and, if available, in which cells the TFBM was experimentally verified in. You can also view the Binding matrix** by clicking on the matrix ID. This opens a pop-up window which displays the binding matrix used and a binding score representing how well a particular site matches the binding matrix.

We can explore more detailed data by adding further Regulation tracks. Click on the Configure this page button on the left-hand side.

In the pop-up window, go to Regulation and click on Features by Cell/Tissue to view the detailed activity of the regulatory feature by cell type.

We can add cells by clicking on them. Find them using the search or the alphabet ribbon. Let’s add a cell type where the promoter is inactive, aorta, and one where it’s active astrocytes. Once you’ve selected the cells, they will appear in the menu on the right, where you can easily view the list by clicking on the + icon and de-select them.

To choose the experiments to see data on, click on the Experiments tab at the top of the menu. You can navigate this the same as the Cell/Tissue tab, except that you have to choose between Histone, Open Chromatin and Transcription factors. Let’s Select all in all categories.

When you’ve chosen your experiments and cells, you can click on the green Configure track display button in the bottom right-hand corner.

Now we can see the active feature in astrocytes compared to the inactive feature in aorta.

Regulatory features between INSIG1 and BLACE in human

  1. Find the Location tab (Region in detail view) for the region between the genes INSIG1 and BLACE. Are there any predicted enhancers in this region?

  2. Go to the Regulation tab for the enhancer ENSR00001133586. How many cell types is this enhancer active in? Are there any cell types where its activity is repressed?

  3. Switch to the Location tab. Take a look at the histone modifications across this enhancer in neutro myelocyte cells, where this enhancer is active, compared to neutrophil (CB) cells, where it is poised. What differences can you observe?

  4. Are there any verified transcription factor binding motifs in this enhancer? In what cells?

  1. Search for human INSIG1 from the Ensembl homepage. Click on INSIG1 genomic coordinates 7:155297776-155310235:1 in the search results to open the Location tab directly. In the Region overview display, drag out a box to encompass the neighbouring BLACE gene. Scroll down to the Region in detail display. Have a look at the Regulatory Build track. You can find a legend of this track underneath the display.

    There are 5 yellow enhancers in the region between the genes INSIG1 and BLACE.

  2. There are several ways to search for the enhancer. You can click the different enhancer features in the Regulatory Build track to find ENSR00001133586, or you can search Ensembl for the ID ENSR00001133586 and navigate to the Regulation tab. Under the Activity display, you can find the activity of the regulatory feature across different cell types.

    ENSR00001133586 is active in neutro myelocyte cells only. It is repressed in 34 cell types.

  3. Click on the Location tab. Choose cells by clicking on the Configure this page button on the left-hand panel or Add/remove tracks button above the Region in detail display. In the pop-up window, click on Features by Cell/Tissue in the left-hand menu. Select neutro myelocyte in which this enhancer is active and neutrophil (CB) in which it is poised. Add experiment tracks by clicking on Experiments tab and Select all under Histone. Click Configure track display, then View tracks to load the page.

    Both cell types have H3K27me3, H3K4me1 and H3K9me3 histone modifications at this locus, while neutro myelocyte cells also have H3K27ac and H3K36me3 modifications, and neutrophils (CB) have H3K4me3 modifications. The different clusters of peaks indicate different epigenetic profiles, which might explain the difference in the enhancer activity between these two cell types.

  4. Stay in the Location tab. Click on the Configure this page button on the left-hand panel or Add/remove tracks button above the Region in detail display. In the pop-up window in the left-hand menu, go to the Regulation section and click on Other regulatory regions. Enable the Motif features track to visualise any transcription factor (TF) binding motifs. Close the pop-up window. Find the Motif features track. There are two black markers indicating verified TF motifs. Click on them to tell which motifs and which cells.

    The two motifs are both verified in K562 cells and bind a number of different TFs. The ENSM00523362328 motif binds ELF1, ELF2, ELK1, FLI1, ERG, ETS1, ETV6, FOXO1::ELK3, FOXO1::ETV1, ETV1, ETV2, ERF, ELK3, ETV3, GABPA, ETS2, ELK4, FEV, ETV5 and ETV4. ENSM00523900117 binds ETV7, ETS1 and ELK1::SPDEF.

Regulatory features in human

  1. Search for the regulatory feature ENSR00000262400. What type of feature is this? What is its genomic location?

  2. Which cell types is this feature inactive and/or repressed in? View the supporting evidence for the repressed cell type. What project was the repressed cell type studied in?

  3. Why do so many cells have this feature listed as NA on the Activity display?

  1. Search for ENSR00000262400 on the Ensembl homepage. Click on the search result to open the Regulation tab.
    ENSR00000262400 is a CTCF binding site found at Chromosome 11: 1,998,001 - 2,001,400, which can be found at the top of the Activity page.

  2. Scroll down to see the summary of regulatory activity across different cell types.
    The CTCF binding site is inactive in H1-hESC_3 and HepG2 cells. It is repressed in A673.

    Click on Source Data at the top of the page or in the left-hand menu. Use the filter at the top right-hand corner of the table and enter A673. You can find the source of the supporting evidence under the Source column. The cell type A673 was studied in the ENCODE project.

  3. Note that many cell types have this feature represented as NA. This is because no corresponding CTCF signal or peaks are available for these cell types as they were not studied in the project sources.
    Cells which do not have CTCF ChIP-seq data cannot have an activity listed for this feature.