|Extract specified values from strings|
|Algorithms - Strings|
|Written by Jan Schulz|
|Tuesday, 12 February 2008 00:15|
Extract the nth number from a string
Several applications require the extraction of numbers with a certain index from strings of arbitrary length. Typical examples are data strings acquired via RS232 or USB interfaces from external devices. A data set of three lines may look like this:
When the position of the requested value can not be achieved by defined index positions for the first and last character the string needs to be parsed. The easiest way is to start at the first symbol of each line, search and count the delimiters and cut out the requested item. However, a number of scenarios are imaginable when this is insufficient. An example would be unknown, changing or heterogeneous delimiters due to data migration from different sources. Alternatively, if one is interested in the nth value in the string and not in the value behind the nth delimiter or wants to identify whether a string contains at least n numbers. Additionally, one might only be interested in positive values or integers. In this case decimal separators or signs can also be delimiters. Here, an algorithm is presented that parses a string and extracts the nth value, if present. The interpretation of minus signs and decimal separators can be adjusted.
To solve this task implementation of an adopted state machine is advised. This machine needs to successively evaluate symbols of a passed string in the normal reading direction. A grammar controls different processing steps, e.g. when a symbol belongs to a number or needs to be interpreted as delimiter. Also, it determines how to identify and count the position of found values and define how negative values or those with decimal separators are treated. These rules determine how to modify the internal states until a stop condition is fulfilled.
To set up a state machine for the identification and extraction of a given number from a string requires a universal set of instructions covering all symbol combinations. Therefore, the different symbols a number consists of need to be identified. Four different types of characters can be distinguished:
Any other symbol delimits numbers. The presence of the the plus sign is not mandatory for positive values. Thus, the list is reduced to three symbols and instruction classes to distinguish while parsing.
When processing decimal separators it is necessary to recognize regional preferences. While countries like Australia, China, Saudi-Arabia, United Kingdom or United States use the decimal point, others like Germany, the Scandinavian countries, Spain or wide parts of South America and Russia make use of the comma. Comma and point can only be used interchangeably as decimal separators when the respective other symbol is not used to separate groups of three digits (thousands) in larger numbers. Consequently, a string containing numbers with both separators can not be extracted correctly.
Four parameters are necessary for initialisation of the state machine. The first is the string the state machine is to extract a number from. The second is the command which number in the line shall be extracted. Finally, it needs to know how to handle negative values (minus signs) and decimal separators. These variables are defined here as aString, ValueNo, IncludeMinus and IncludeDecSep.
As prerequisite condition aString needs to have at least one symbol, otherwise the algorithm needs not to start. If symbols are present the algorithm starts at the first position of aString, successively evaluates symbols and attaches those being part of a grammatically correct number to a temporary string.
Further the state variables are required, defining processing rules. These are handled and modified internally during sequential processing. First is the result string (Result) that contains the requested value after a successful search. Second is an integer variable that points at the actually processed symbol in the string (StringPos). An integer counter (ActualValue) denotes the actual number of values found to the left of the position of StringPos. Any value extracted form aString is stored in a second variable (SubString). When the machine reaches a final state and the content of SubString is the requested number this content is returned by the function in Result. A Boolean state variable (isExtracted), set when the requested number was found, is used to initiate search termination. A second Boolean variable (wasDecSepFound) displays whether the value in SubString already bears a decimal separator.
A termination state is reached when isExtracted is set or StringPos reaches the last symbol of aString. After termination the content of SubString can be returned as result, if ActualNumber = ValueNo. In this case the nth number was found. It is unsufficient to test for isExtracted as it would be set after StringPos is greater or equal than the length of aString. Thus, the function would fail when the requested number stands at the end of aString. With these outer definitions the algorithmic backbone of the state machine can be planned (Figure 1).
Numbers must contain digits and can not just consist of minus sign or decimal separator alone. These are optional symbols in numbers. For this the counter ActualNumber is only increased whenever the found digit is the first digit in SubString. Four cases can be distinguished for the beginning of a new number. Any digit indicates the beginning of a new number when:
At this point it is not necessary to evaluate IncludeMinus and IncludeDecSep, because these symbols will never be included in SubString when the respective flag is not set. As ciphers can stand at every position and consecutive digits belong to the same number no further limitations disallow appending.
By definition minus signs can only stand at the first position of a number. Thus, any occurrence signalises that maybe so far stored digits in SubString finally represent a number. If ActualValue is equal to ValueNo then SubString contains a copy of the requested number (and) isExtracted will be set to state termination of search.
Whenever SubString does not contain the requested number its content can be discarded.
A decision has to be made whether minus signs are delimiters (IncludeMinus = FALSE) or are included by the grammar (IncludeMinus = TRUE). In the first case SubString remains empty at this point, while in the latter the minus sign is the first symbol of any potentially following number. In any case, with the beginning of a new number the state variable wasDecSepFound also needs to be reset.
Decimal separators are usually flanked by digits on both sides. Nevertheless, notation allows writing numbers without ciphers on one side of its decimal separator but never both. Numbers like:
are true values, although successive or leading zeroes were omitted in this notation. For this reason an increment of ActualNumber was linked to the finding of ciphers (see above in subsection Cipher).
First differentiation after finding a decimal separator in aString is whether these symbols represent delimiters or are evaluated as belonging to a number. When point and comma are delimiters they signalise that maybe existing digits in SubString are finalised. If then ActualValue = ValueNo a copy of the requested number is found in SubString like in the above handling of minus signs. Then isExtracted will be set to signalise termination of search. Otherwise the content of SubString is discarded and thus prepared for the next try.
When decimal separators are no delimiters it is important to control whether one of these is already present in SubString. Only one decimal separator per number is valid. When the state variable wasDecSepFound is not set the decimal separator is attached to SubString. Using the variable DecimalSeparator of the operating system ensures proper conversion from string to float later.
When the decimal separator is no delimiter and one of this symbols is already included in SubString again the option arises, whether ActualValue = ValueNo to terminate search. Otherwise Substring will be deleted, a decimal separator is used as first symbol therein and wasDecSepFound set.
Function GetNthValueInString (aString: String; ValueNo: Integer; IncludeMinus: Boolean; IncludeDecSep: Boolean): String;
Observing the variables while parsing the second line of the fictive example above shown in the Objective section above demonstrates internal functioning of the state machine. The state of the variables is displayed after the actual symbol is processed. This point is marked with an asterisk in Figure 1.
Example 1: A call of the function with:
results stepwise in:
After the fourth step a symbol that does not belong to a number causes evaluation of the boolean expression (ActualNumber = ValueNo). As the outcome is TRUE, isExtracted is set the content of SubString is not discarded and causes the state machine to halt. Thereafter, the same evaluation returns content of substring as result.
Example 2: Calling the function with the parameters:
successively results in:
When StringPos is as high as the number of symbols in aString the algorithm terminates the loop. As already mentioned isExtracted is not set to TRUE at this point. Consequently, the boolean expression (ActualNumber = ValueNo) decides whether the content of SubString is returned as result.
The described algorithm works correct under the terms described above. However, the critical reader may be interested in analysing the above explanations in more detail. He will find two sites where state variables are modified although not necessary. Additionally, the code may show some redundancies.
|Last Updated on Friday, 18 March 2011 18:26|