|Inroduction to dot-plots|
|Algorithms - Dot-plots|
|Written by Jan Schulz|
|Thursday, 15 May 2008 22:25|
This article is a collaborative work of Jan Schulz1, Florian Leese2 and Christoph Held1.
What is a dot-plot
Dot plots are a data analysis tool to scrutinise symbol sequences. Arrangement recurrences, similarities and inherent structures are visualised in two-dimensional plots. Repetitions of individual symbols, as well as motifs being similar appear as characteristic lines, rectangles, textures or combinations thereof. Through visualisation such structures can intuitively be distinguished from background noise and conceived by human eye. For two arbitrary sequences dot plot algorithms completely enumerate substrings of defined length. Each substring of the first sequence is aligned with each of the second. For each alignment a score for pairwise matching symbols among the two substrings is computed and visualised by colour mapping. The substring positions in the sequence determine the position in the dot plot and result in a concise representation of similarities among sequences.
In the field of genetics, the tools most commonly used today are alignment-based methods. However, alignment-based methods generally interpret incongruent character states in relation to surrounding identities, as the result of punctual deletion, insertion and alteration of character state. Consequently, they tend to perform poorly when the actual evolutionary process violates these implicit assumptions, e.g. when reversal or relocation of parts of a sequence took place. In addition, by way of their quest to present a single solution of positional homology, alignment algorithms are not designed to identify simple sequence repeats. To address this task dot plots are a powerful approach. Beside perfect motif repetition dot plots also reveal particular information on imperfect repetitions.
The dot plot technique dates back to the late 1960’s and early 1970’s (Fitch 1969, Gibbs & McIntyre 1970) and was developed in the field of genetics. Today, the most powerful application for dot plots is still the domain where they initially emerged from (e.g. Gibbs & McIntyre 1970, Maizel & Lenk 1981, Staden 1982, Brown et al. 1995, Dunham et al. 1999). Since the early approaches more sophisticated software tools have been designed (e.g. Staden 1982, Junier & Pagni 2000, Krumsiek et al. 2007) and continuing technological progress allows for increasingly faster and more complex analyses. Several state of the art programs have implemented dot plot modules for homology comparisons and pattern recurrence detection (e.g. STADEN, Geneious, DNAstar, BioEdit). Dot plot applications are particularly useful in the identification of interspersed repeats such as transposons and tandem-repeat motifs such as microsatellites (Leese et al. in press). Furthermore, loss or gain of whole motifs can easily be spotted in different types of domains, a trait useful in characterising the evolution of certain protein families (Beaussart et al. 2007). Dot plots are also employed in the investigation of properties of protein coding sequences by predicting secondary structures, like stem-loop formation or structural RNA domains (e.g. Guttell et al. 1993). In the field of DNA based watermarks adopted algorithms can support identification of patent infringements of genetically modified organisms (e.g. Heider & Barnekow 2007) during routine screening.
Although the main application of dot plot lays in genetics the method itself is context independent. Therefore it can be applied to any sequences where symbols represent items or tokenised objects. Thus, the method allows to investigate patterns in texts or chronological events and has a broad application field when one requires scrutinising sequences for such patterns (Bernstein et al. 1991). Dot plots have also been applied in software development, to identify redundant code and to ease maintenance of design patterns among releases (Helfman 1996). Historians and literary scientists can use dot plots to follow transcriptions and semantics over time. Comparing sequences of different origin they allow visual identification of source insertions, deletions or homologies, culminating in plagiarism detection. In highly volatile text systems, like websites, they have proven to support tracking of evolutionary modification processes (Bernstein et al. 1991). In translations they allow a more effective organisation of workflows and help institutions to reduce expenses (Helfman 1994, Helfman 1996).
As an initial example for dot plots one can imagine the same sequence written onto two strips of chequered paper. Every symbol of the sequence is written consecutively into one chequer, with its index number next to it. By overlaying a frame containing a window that allows viewing exactly one symbol of each strip at a time symbols are compared in pairs (Figure 1). Whenever symbols in the observing windows match, a bright dot is placed in a grid at the respective indices. The resulting rectangular graphical representation is a dot plot. It thus represents all possible comparisons of characters in either sequences and is colour-coded with two colours indicating a match or mismatch between any two characters. The resulting rectangular graphical representation is a dot-plot.
The dot plot result of comparing all symbols with each other can be represented by a matrix hereafter named M. The two sequences are furthermore referred to as T1 and T2. In this initial example is T1=T2. In the following we respect the formal convention that the first index refers to the row, the second the column; M(row, column). As the first sequence is normally presented on the horizontal axis processing of T1 is assigned to column and T2 to the row indices.
Detection of signal and noise in dot plots
Here, the sequence was compared against itself and results in a self-similarity dot-plot. The emerging dot plot shows a pronounced diagonal with a symmetric distribution of several points on both sides of it (Figure 1, dot plot chart). The diagonal marks the comparison of each symbol with itself and thus appears as a distinct and continuous line (Figure 1). The recurrent appearance of the motif ‘THE’ followed by a space is a second pattern creating a four dots long line. As this repetition has the same reading direction within the sequence, the line is parallel to the diagonal. The horizontal and vertical offset along the axes is equal to the respective distance in the sequence. In addition to recurrences of individual symbols whole motifs can be found as reverted copies within larger sequences and result in at least partially palindromic structures. The term palindrome defines motifs of individual symbols that read the same from both directions.
Single bright dots in the plot indicate positions where other symbols match. As these points are scattered on both sides of the diagonal and show no regular organisation, they can be assumed to represent random matches. Random matches are a function of sequence length and alphabet size. The stochastic probability of random pattern repetitions decreases with increasing alphabet sizes, while the total number increases with sequence length. In this example a relatively short sequence with a large alphabet of 27 symbols (characters ‘A’ to ‘Z’ and the space character) was used. This reduces the number of random matches, which appear as random, off-diagonal noise in the dot plot (Figure 1).
Investigating sequences of nucleic acids an alphabet size of four symbols and sequence lengths of several hundreds to thousands characters are common. This entails more background noise as compared to a dot plot of amino acid sequences. The probability of a random match between any two nucleic acids is 0.25 whereas it is 0.05 for amino acids with an alphabet size of 20. Using a window size of one symbol for the analysis of the arbitrary gene sequence EU127468.1 (Table 1) the plot has a high level of background noise and is of minor use, when interested in large scale patterns or structures (Figure 2, upper panel). This effect is even more pronounced, when applying dot plots to sequences of binary information. Consequently, filters are required to emphasise regions with a number of consecutively matching symbols. Therefore window size is enlarged to compare pairs of more than one consecutive symbols for each dot in the plot.
Identity dot plots compare substrings of defined length starting from every position on the respective strands. As alignments within the windows are only performed in reading direction the detection of reverted structures and palindromes is reduced with window sizes larger than one. Beside literal palindromes palindromic structures also play an important role in genetic sequences, where definition is slightly different. The DNA double helix is formed by two complementary paired strands. Biological reading direction follows the chemical enumeration of the phosphate linked ribose backbone from the 5’ to 3’ end and is oppositely oriented on the two strands. The base pairing between strands is determined by the two complementary base pairs adenine and thymine (A-T), as well as cytosine and guanine (G-C). Such sequences, or parts of it are palindromic when the first strand reads the same as its reverse complement in the respective reading direction. To overcome limitations of the identity dot plot, a parallel analysis needs to be included, to detect structures like the motif 5’-TATAGCTATA-3’, which is palindromic to its reverse complement. Thus, the detection of reverted motifs on the complement strand in particular for larger window sizes is of major interest. Therefore sequence are also compared with reverted copies or in case of a genetic sequence with its reverse complement of the opposite strand.
The elements of the result matrix M are ordinal numbers representing the number of pairwise matches for the respective substrings. These ordinals can be directly used as indices pointing at entries in a colours palette array. The size of the colour map is the maximum number of possible matches plus one colour for the background; equivalent to #colour = 1+w. By using a grey-scale palette areas appear gradually highlighted by their number of consecutively matching symbols (Figure 2, lower panel). Additionally, one colour can be assigned to more than one element of the colour map and thus to more than one ordinal. Black is often used for a range of low numbers, to fade out regions of low similarity or complexity (compare Figure 2, lower panel and Figure 4 upper panel). Assigning bright colours to more ranks than the highest these structures are emphasised. The plots for this paper were created with the software Serolis (Schulz 2008), where the here described techniques were implemented.
Remarks on the window size
The chosen window size is of eminental importance in the visualisation process as it determines the criteria by which imperfect matches are distinguished as geometric figures from the background. Larger values allow finer colour graduations and focus on larger structures of consecutively matching symbols. In palindromes of even length each symbol appears at least twice, while odd lengths may have a unique central character. Thus, window sizes of one symbol impede identification of such patterns. For dot plots including reverted analyses appropriate window sizes are generally larger than three.
General remark on the algorithms
Comparing two different sequences the dot plot matrix needs to be entirely calculated. Investigating the algorithms it is found that runtime is a function of l*k*w for source code 1 and l*k*2w for source code 2. It is easily seen that runtime increases with increasing w, as often required for already larger sequences. Compared to fast pattern matching algorithms, like Knuth-Morris-Pratt or Boyer Moore (e.g. Gusfield 1997 and references therein), dot plots have the disadvantage that symbols within the windows need to be evaluated completely, but have the major advantage that they allow distinguishing similarities from imperfections. Thus, optimisation of match evaluation in the sliding window is a limiting factor. Sonnhammer and Durbin (1996) introduced a method with vectors, initialised for pairwise matches whenever the sliding window moves over a matching position in the sequences. Each vector contributes to the result as long as its position lays within the window. The outermost vector is discarded, when the window shifts one position towards the next position to compare. Huang & Zhang (2004) introduced a method creating look-up tables and initial tokenising of possible symbol combinations to increase calculation speed for DNA/RNA dot plot applications. The initial table setup requires time paying off with larger sequences. Small alphabets of the four nucleotides agree with this method. However, tables grow exponentially when forming words of certain lengths including e.g. the complete set of DNA IUPAC symbols (IUPAC 1984).
To reduce memory consumption for larger sequences compressions are used to focus on large scale structures. Therefore the plot is split into rectangles of equal size. Within these rectangles scores are evaluated completely. In the compressed plot each rectangle is represented by a single dot giving a proxy value for the rectangle (Sonnhammer & Durbin 1996). The disadvantage is that calculations need to be performed twice when zooming into a plot.
To reduce stack load results of substring comparison can be directly passed to a visualisation routine and not returned as complete l´k matrix. When a sequence is compared with itself less than half of the result matrix needs to be calculated. In this case it can be made use of the fact that M(i, j) = M(j, i) and M(i, i) = w.
Comparing a sequence simultaneously with a couple of others it is possible to overlay results (Vihinen 1988). To identify common similarities further methods are available to analyse dot plots for motif and alignment recognition (Vingron & Argos 1991) and to classify sequences by multivariate analyses of dot plots (Landès et al. 1993).
For some applications it may also be interesting to create dot plots that display lowest instead of highest counts to visualise matches being different from zero. Additionally one might be interested in different ways of counting or evaluating sequence substrings, to weigh matches differently or to return a normalised substring similarity (q%= 1 / w * #matches) for subsequent numerical processing.
Interpreting dot plots
The upper left corner of a dot plot normally shows the comparison of the first substring set from the two sequences. Down- and rightward indices of alignments increase. Especially in older publications origins of sequences is sometimes found in the lower left corner, increasing up- and rightward. The operating direction of a software package can be easily identified by plotting a sequence against itself. In this case dots on the main diagonal represent substrings from the positions i=j. This is equal to compare a substring with itself and results in a pronounced main diagonal where the number of matches is equal to the windows size w (Figure 6a). The plot is then square and symmetric across the main diagonal. Patterns off the main diagonal indicate structural similarities among different parts of the sequences. Lines off the main diagonal indicate that the different regions on the sequences are have a high similarity, maybe due to repetition (Figure 6b). The location of these repetitions is equal to the axis intercepts in the plot (D1, and D2 in Figure 6b). When D1 and D2 intersect on the same axis, regions in the respective strand are present that bear self-similarities. Including the reverted or complementary strand in the analysis, sequence similarities of reverted motifs are indicated by lines orthogonal to the main diagonal (6c, 6d). In genetics many transposable elements are characterised by short duplicated regions or terminal inverted sequence repeats that can be identified by their typical structure in dot plots (Figure 6d). Tandem repeats, such as microsatellites (Figure 6e) and minisatellites (Figure 6f) (microsatellites = repeat unit 1-6 base pairs, minisatellites = repeat unit of 10-50 base pairs) are characterised by parallel lines forming a rectangular shape. If two different sequences are compared against each other (Figure 6g, 6h) gaps in the main diagonal indicate deviations. If the main diagonal shows a break (Figure 6h) this point indicates the location of a fragment in one strand that was deleted or inserted in the other one (so called ‘indel’). This is also a common feature when plotting cDNA sequences of a gene against the unspliced genomic sequence.
Analysing the identity dot plot (T1=T2) of the sequence EU127468.1 (Figure 2, lower panel) the continuous main diagonal is found. Additionally, four bright rectangular patterns can be seen. Two of them are square and located on the main diagonal approximately between positions 65-95 and 245-290 (Table 1, symbols printed in bold). They represent regions of high self similarity due to local microsatellite repetition of short (AC)x tandems. The rectangular patterns off the diagonal indicate similarity between the two repeat regions on the diagonal to which they are located perpendicular. This results from high similarity that is not only found around position T1[i]/T2[i] and T1[j]/T2[j] but also between T1[i]/T2[j] and T1[j]/T2[i] and makes up the four symmetric rectangles recognized. the lengths of the sides of the rectangles are equal to the outlines of the opposite squares on the diagonal and interconnected with these by broad dark streaks. They indicate the reduced probability of random matches between the tandem repeats and non-repeat regions.
Application of dot plot in genetic research
A concise application for dot plots is genetic probe design and comparsion of evolutionary altered sequences. Here, the advent of the polymerase chain reaction paved the way for a variety of applications that rely on identifying and amplifying specific genomic regions. Selected probes often fail to amplify the specified regions or amplify multiple genomic regions. While several probe-design programs (Kalendar et al. 2003, Rozen et al. 2000) aid the process of design and optimisation according to thermodynamic rules, amplification can fail as probes might hybridise to regions in the complementary strand unnoticed by the primer design program, as the motif is completely different. Dot plots including reverse complement analysis allow identification of such sites. For microsatellite analyses in population genetic or mapping studies the aim is to specifically amplify highly variable genomic regions using conserved flanking regions. Flanking regions are often much less conserved and an integral part of higher-order repeats. Dot plots allow the identification of these super-order structures and avoid locating probes within them (see Leese et al. in press).
If a priming site is duplicated or even duplicated further obscured due to inversions, successful and specific hybridization or amplification of single sites can be counteracted. High numbers of sequences from the enrichment procedure can be screened. While appropriate tools for the identification of tandem repeats exist (e.g. Tandem Repeats Finder, Benson 1999 and Phobos, Mayer, www.rub.de/spezzoo/cm) the detection of higher-order or highly degenerated repeats, which should be avoided as candidate markers, still pose a complicated issue. Additionally, appropriate primer sites should not be located within duplicated regions, counteracting amplification success or leading to the amplification of paralogous gene copies. Dot plots greatly facilitate testing for the presence of such sites and proved to be helpful tools for the de novo development of appropriate microsatellite markers (Leese et al., in press, Leese et al. 2008).
Comparative genomic and phylogenetic studies compare DNA regions with the same ancestry and analyse the amount of accumulated differences. With growing phylogenetic distance differences increase, so that inherited similarities are difficult to detect using standard alignment methods. This holds true in particular if rearrangements of sequence domains have occurred (e.g. Sonnhammer and Kahn 1994). Here, dot plots are a powerful tool to visualise similarities and deviations. Cytochrome c oxidase is a small protein in mitochondrial membranes. Comparing a partial DNA sequence from Homo sapiens (GenBank accession number EF568637) with that of the distantly related dinoflagellate Alexandrium tamarense (GenBank accession number EF036567.1) few similarities are found (Figure 7, upper panel). The resulting dot plot, including reverse complement analysis, shows a disrupted main diagonal as a result of several substitutions. Additionally several motifs exist, which are palindromic in the opposite sequence. However, when the translated amino acid sequences is compared a much higher conservation level is found(Figure 7, lower panel) due to the degenerated genetic code at DNA level.
|Last Updated on Friday, 18 March 2011 17:55|