Computing the reverse complement of a DNA strand
Objective
Pairing of bases in DNA strands is determined by the chemical properties of the nucleotides. Adenine pairs with thymine and guanine with cytosine. Thus, the sequence of a single strand comprises the appearance of its complementary one. The reading direction of the two complementary strands of a DNA molecule are opposite and thus code for a different protein sequences (pallindromic sequences may be mentioned as exception). Analysing DNA often necessitates investigation of products from both strands. Especially when unknwon which strand is sense and which anti-sense.
Algorithm
To create the reverse complement of a DNA sequence it needs to be reverted to maintain the notation of symbol succession from 5’ to 3’ end. Additionally every nucleotide of the source strand is substituted by the complementary base which has the ability for chemical pairing. Here it must be distiguished between the four nucleotides. Furthermore the International Union for Pure and Applied Chemistry defines more than a dozen of additional symbols (IUPAC 1984). These symbols are written for ambiguous nucleobases and show potential presence of one base out of a defined set. These ambiguities may result e.g. from not distinctly determinable bases during sequencing and need to be transformed with respect to possible pairings. As upper- and lowercases may contain information about the reliability of bases at a given position, this characteristic should also be maintained.
The IUPAC nucleotide code
|
codes for nucleic base
|
and is translated to as reverse complement
|
A
|
Adenine (A)
|
T
|
C
|
Cytosine (C)
|
G
|
G
|
Guanine (G)
|
C
|
T
|
Thymine (T)
|
A
|
R
|
A or G
|
Y
|
Y
|
C or T
|
R
|
S
|
G or C
|
S
|
W
|
A or T
|
W
|
K
|
G or T
|
M
|
M
|
A or C
|
K
|
B
|
C or G or T (not A)
|
V (not T)
|
D
|
A or G or T (not C)
|
H (not G)
|
H
|
A or C or T (not G)
|
D (not C)
|
V
|
A or C or G (not T)
|
B (not A)
|
N
|
A or C or G or T (any)
|
N (any)
|
.
|
Gap
|
.
|
-
|
Gap
|
-
|
?
|
Unknown
|
?
|
These specifications result in a simple, self-explaining code. The source DNA sequence is passed to the function as string. When it contains characters, the function starts at the last position, moving backwards to the first. Beginning with an empty result string, for every character found in the source string one is attached to the result string. Whenever the character of the source string is a IUPAC nucleotide code, its reverse complement is attached.
In any other case the source character is used and the function’s return value is overwritten by the actual positon. Although unknown characters are mainly spelling mistakes, they can not be omitted. Changing the number of bases would destroy the original reading pattern for this and all following triplets during translation. To reverse processing of the source sequence results in an overwriting of errors leftwards from the last. Thus, when errors occure the first one in the source sequence, read from left to right or respectively from 5’ to 3’ end, is returned. Any value different from zero notifies that the result may be unreliable. When the passed source string is empty the function returns –1.
Source code
Function ReverseComplement (Const aSourceString : String; Var ReverseComplement : String) : Longint; // The function ReverseComplement computes the reverse complement of a DNA // strand. It handles all IUPAC codes for nucleic acids and maintains // upper- and lowercase coding. Functions returns 0 when successful, -1 // when aSourceString is empty and a positive value of the first non // IUPAC symbol in aSourceString, if any found. // (c) Dr. Jan Schulz, 20. November 2007; www.code10.info Var aPos : Integer; // aChar : Char; Begin // initialise returned sequence ReverseComplement := '';
// terminate when no characters are available to create a reverse complement If Length (aSourceString) = 0 THen Begin Result := -1; Exit; end;
// we hope to find valid IUPAC DNA codes only Result := 0;
// travelling along the source string from the rear end to the beginning For aPos := Length (aSourceString) downto 1 do Begin // try to find the complement of every char of the source string Case aSourceString [aPos] of // nucleic bases 'A' : ReverseComplement := ReverseComplement + 'T'; 'T' : ReverseComplement := ReverseComplement + 'A'; 'G' : ReverseComplement := ReverseComplement + 'C'; 'C' : ReverseComplement := ReverseComplement + 'G'; 'a' : ReverseComplement := ReverseComplement + 't'; 't' : ReverseComplement := ReverseComplement + 'a'; 'g' : ReverseComplement := ReverseComplement + 'c'; 'c' : ReverseComplement := ReverseComplement + 'g'; // one of two possible nucleic bases 'R' : ReverseComplement := ReverseComplement + 'Y'; 'Y' : ReverseComplement := ReverseComplement + 'R'; 'S' : ReverseComplement := ReverseComplement + 'S'; 'W' : ReverseComplement := ReverseComplement + 'W'; 'K' : ReverseComplement := ReverseComplement + 'M'; 'M' : ReverseComplement := ReverseComplement + 'K'; 'r' : ReverseComplement := ReverseComplement + 'y'; 'y' : ReverseComplement := ReverseComplement + 'r'; 's' : ReverseComplement := ReverseComplement + 's'; 'w' : ReverseComplement := ReverseComplement + 'w'; 'k' : ReverseComplement := ReverseComplement + 'm'; 'm' : ReverseComplement := ReverseComplement + 'k'; // one of three possible nucleic bases 'B' : ReverseComplement := ReverseComplement + 'V'; 'D' : ReverseComplement := ReverseComplement + 'H'; 'H' : ReverseComplement := ReverseComplement + 'D'; 'V' : ReverseComplement := ReverseComplement + 'B'; 'b' : ReverseComplement := ReverseComplement + 'v'; 'd' : ReverseComplement := ReverseComplement + 'h'; 'h' : ReverseComplement := ReverseComplement + 'd'; 'v' : ReverseComplement := ReverseComplement + 'b'; // special characters 'N' : ReverseComplement := ReverseComplement + 'N'; 'n' : ReverseComplement := ReverseComplement + 'n'; '.' : ReverseComplement := ReverseComplement + '.'; '-' : ReverseComplement := ReverseComplement + '-'; '?' : ReverseComplement := ReverseComplement + '?'; Else // if an unknown symbol occures: return position and attach unchanged Result := aPos; ReverseComplement := ReverseComplement + aSourceString [aPos]; end; end; end;
|