Get genes by protein domain
Retrieve the protein sequences (in FASTA format) of all wheat genes that have an NCBI Gene ID, that are protein coding and with Transmembrane helices.
Do a count after selection of each filter to check the number of genes remaining in your dataset.
Export the results of the sequences and select Gene description and Source of gene name as headers.
Click on BioMart on the navigation bar at the top of the page. Click the New button on the toolbar on the top left-hand corner, choose the Ensembl Plants Genes database and Triticum aestivum genes dataset.
Now, filter for the genes with NCBI Gene ID only:
Click on Filters in the left panel, expand the GENE section by clicking on the + box. Select with NCBI Gene ID under Limit to genes (external references)…. Make sure the box next to the filter is ticked, otherwise the filter won’t work.
Now click the Count button on the toolbar.
This will give you 92 Genes.
Now filter further for genes that are protein-coding by selecting Gene type – protein_coding and click again on Count.
This still gives you 92 Genes, meaning that all genes you have previously filtered are protein-coding.
Finally, filter for genes that have a signal peptide domains. Expand the PROTEIN DOMAINS AND FAMILIES section by clicking on the + box. Select Transmembrane helices – Only under Limit to genes ….
There are 79 genes on the bread wheat genome that contain NCBI Gene IDs and protein coding with signal domains.
Go to Attributes on the left-hand panel. Select Sequences from the options on the right. Expand the SEQUENCES section by clicking on the + box and select Peptide. Select the appropriate header information from the HEADER INFORMATION section: Gene description and Source of gene name.
Click on Results on the toolbar and the sequence will be shown as FASTA format.