In this exercise, you will look into different gene-finding approaches learn various things about a contig from a genome assembly and annotate the genes it encodes (see sequence assembly lecture if you need a reminder of the vocabulary).
There are two main methods for automatic gene prediction: ab initio methods and comparative methods. Ab initio methods use the DNA sequence as the only input and are referred to as intrinsic methods. There are several features that can be identified in a genomic sequence and used to identify genes computationally. Such features are related either to the signals that regulate the biological mechanisms of gene expression (signal sensors), or to biases in sequence composition in DNA regions that are translated into proteins (content sensors). Signal sensors are typically splice-sites (donor: GTRAGT, acceptor: YAG, branch-site: CTRAY), the start of translation (codon ATG), and the end of translation (codons TGA, TAA, and TAG).
The content sensor most commonly used is bias in codon usage: regions of DNA coding for a protein use some codons more frequently than others. Both signal sensors and content sensors must be trained, i.e., we must start from a set of observations (such as known genes) from which we build a sensor model. Predicting a gene therefore involves looking for new features in the genomic sequence that resemble our model. The resemblance can be established in terms of probabilities.
Comparative methods are called extrinsic methods. They include two strategies: those that use homology with sequences from other genes, also called homology-based, and those that make comparisons with genomic sequence from other genomes, also called comparative-genomics-based. Homology-based methods predict a gene from the alignment of a protein sequence, or an RNA sequence in the form of a full-length mRNA, cDNA or EST (expressed sequence tag), with the genome sequence that we want to annotate. The known sequence (also called evidence) guides the prediction. There are several ways of applying homology-based methods. The simplest is to accept the alignment of the known sequence to the genome as the gene prediction. More advanced methods use the known sequence as a guide and try to complete the evidence to yield a complete gene structure. The efficacy of the latter method depends on the number of known gene sequences; hence it is limited by the completeness of biological databases. Comparative-genomics-based methods hypothesise that any sequences conserved between two relatively closely-related genomes are functional and likely to code for a gene.