Help on motifs
Sequence Logos and Consensus Sequences
Sequence logos and consensus sequences are alternative ways of representing the sequences in a conserved motif.
A Sequence Logo is a graphical representation of the nucleotide 'information' in a stack of sequences. Developed by Tom Schneider and Mike Stephens, a logo shows a stack of nucleotide characters for each position in the motif's sequence stack.At a position, the height of a character stack indicates sequence conservation, and the height of symbols within a stack indicates the relative frequency of the corresponding nucleotide. The maximum value of the y-axis is 2 bits.
Characters in a consensus sequence are represented by IUPAC ambiguity codes. We assign a character to a motif position using rules adapted from Cavener, Nucleic Acids Res. 15, 1353-1361, 1987:
- A single nucleotide is shown if its frequency is greater than 50% and at least twice as high as the second most frequent nucleotide.
- A two-way degenerate code indicates that the corresponding two nucleotides occur in more than 75% of the underlying sequences but each of them is present in less than 50%.
- Because rules for three-way degenerate codes can be defined in many ways, we represent all other frequency distributions by the letter "n".
Reporting a motif on both strands
Input sequence sets for motif discovery methods consist of corresponding sequence regions from a range of species. Currently, search regions for motif discovery are located largely 5' upstream of a gene's translation start site. Across a range of species, genes that are homologous to a target gene will be found on different genomic strands. Given this, any input set will typically contain sequences taken from different genomic strands.
The system's motif detection methods detect short nucleotide words that are a) well conserved across a set of input sequences and b) over-represented in that set, relative to a background sequence set. Currently, we configure methods to examine both strands of each sequence in an input set. A discovery method reports a motif strand relative to the sequence submitted in the input set. Knowing the genomic strand of an input sequence, we can readily translate a reported discovery strand into a genomic strand.
However, there is a three-part complication that we refer to as a 'reporting uncertainty'. First, at the locations of a motif's conserved words across an input sequence set, reverse-complementary word pairs are conserved on both strands. Given this, a discovered motif across an input sequence set can be reported relative to either strand of a target genome. Second, an iterative method that is allowed to sample a region repeatedly can report a motif on different strands in different iterations. Finally, when motif discovery involves both input strands, we are not aware of an approach that lets us anticipate which strand a discovery method will report for a motif on.
cisRED's user interface design addresses the reporting uncertainty, as follows. Up to v1.2e, the design assigned to a motif the genomic strand that corresponded to the strand reported by the discovery method. However, this approach failed to acknowledge the reporting uncertainty. For v1.2e_Sp2, we considered two approaches. Presenting a motif relative to the target genome's positive strand would simplify the motif display. However, it would probably offer suboptimal support for annotating discovered motifs as 'known' vs. 'novel' (e.g. relative to a resource like TRANSFAC), because the motif orientation presented by such a resource for a TFBS may not correspond to the orientation of the discovered motif on the genomic forward strand.
Given this, our current design displays a discovered motif as a pair of reverse-complementary entities. That is, we display a motif relative to both genomic strands of the target genome. For each atomic motif, the user interface shows a pair of sequence logos, consensus sequences, PFMs and site sequence stacks. We anticipate that this will facilitate interpreting known/novel motif annotations, which we will implement in the near future.