Extract specified values from strings

Algorithms - Strings

Written by Jan Schulz

Tuesday, 12 February 2008 00:15

Extract the n^th number from a string

Objective

Several applications require the extraction of numbers with a certain index from strings of arbitrary length. Typical examples are data strings acquired via RS232 or USB interfaces from external devices. A data set of three lines may look like this:

```
[1]: 0.2321 –1321 00101010 TRUE 0.212
```
```
[2]: 0.5 –1324 00101010 TRUE 0.213
```
```
[3]: 0.75 1325 00101010 FALSE 0.25
```

When the position of the requested value can not be achieved by defined index positions for the first and last character the string needs to be parsed. The easiest way is to start at the first symbol of each line, search and count the delimiters and cut out the requested item. However, a number of scenarios are imaginable when this is insufficient. An example would be unknown, changing or heterogeneous delimiters due to data migration from different sources. Alternatively, if one is interested in the n^th value in the string and not in the value behind the n^th delimiter or wants to identify whether a string contains at least n numbers. Additionally, one might only be interested in positive values or integers. In this case decimal separators or signs can also be delimiters. Here, an algorithm is presented that parses a string and extracts the n^th value, if present. The interpretation of minus signs and decimal separators can be adjusted.

Approach

To solve this task implementation of an adopted state machine is advised. This machine needs to successively evaluate symbols of a passed string in the normal reading direction. A grammar controls different processing steps, e.g. when a symbol belongs to a number or needs to be interpreted as delimiter. Also, it determines how to identify and count the position of found values and define how negative values or those with decimal separators are treated. These rules determine how to modify the internal states until a stop condition is fulfilled.

To set up a state machine for the identification and extraction of a given number from a string requires a universal set of instructions covering all symbol combinations. Therefore, the different symbols a number consists of need to be identified. Four different types of characters can be distinguished:

Ciphers
Decimal separators
Minus signs
Plus signs

Any other symbol delimits numbers. The presence of the the plus sign is not mandatory for positive values. Thus, the list is reduced to three symbols and instruction classes to distinguish while parsing.

When processing decimal separators it is necessary to recognize regional preferences. While countries like Australia, China, Saudi-Arabia, United Kingdom or United States use the decimal point, others like Germany, the Scandinavian countries, Spain or wide parts of South America and Russia make use of the comma. Comma and point can only be used interchangeably as decimal separators when the respective other symbol is not used to separate groups of three digits (thousands) in larger numbers. Consequently, a string containing numbers with both separators can not be extracted correctly.

Algorithm

Four parameters are necessary for initialisation of the state machine. The first is the string the state machine is to extract a number from. The second is the command which number in the line shall be extracted. Finally, it needs to know how to handle negative values (minus signs) and decimal separators. These variables are defined here as aString, ValueNo, IncludeMinus and IncludeDecSep.

As prerequisite condition aString needs to have at least one symbol, otherwise the algorithm needs not to start. If symbols are present the algorithm starts at the first position of aString, successively evaluates symbols and attaches those being part of a grammatically correct number to a temporary string.

Further the state variables are required, defining processing rules. These are handled and modified internally during sequential processing. First is the result string (Result) that contains the requested value after a successful search. Second is an integer variable that points at the actually processed symbol in the string (StringPos). An integer counter (ActualValue) denotes the actual number of values found to the left of the position of StringPos. Any value extracted form aString is stored in a second variable (SubString). When the machine reaches a final state and the content of SubString is the requested number this content is returned by the function in Result. A Boolean state variable (isExtracted), set when the requested number was found, is used to initiate search termination. A second Boolean variable (wasDecSepFound) displays whether the value in SubString already bears a decimal separator.

A termination state is reached when isExtracted is set or StringPos reaches the last symbol of aString. After termination the content of SubString can be returned as result, if ActualNumber = ValueNo. In this case the n^th number was found. It is unsufficient to test for isExtracted as it would be set after StringPos is greater or equal than the length of aString. Thus, the function would fail when the requested number stands at the end of aString. With these outer definitions the algorithmic backbone of the state machine can be planned (Figure 1).

Figure 1: Algorithmic backbone of a state machine for extraction of numbers from a string. The treatments of symbols that occur in numbers are performed in the sub-routines 1 to 3. Any other symbol does not belong to a number and is a delimiter a priori. The asterisk shows the position from where the states are displayed in the Example section.

Ciphers

Numbers must contain digits and can not just consist of minus sign or decimal separator alone. These are optional symbols in numbers. For this the counter ActualNumber is only increased whenever the found digit is the first digit in SubString. Four cases can be distinguished for the beginning of a new number. Any digit indicates the beginning of a new number when:

SubString is empty, or
SubString contains a minus, or
SubString contains a decimal separator, or
SubString contains a minus followed by a decimal separator.

Figure 2: Algorithmic treatment when a cipher was found.

At this point it is not necessary to evaluate IncludeMinus and IncludeDecSep, because these symbols will never be included in SubString when the respective flag is not set. As ciphers can stand at every position and consecutive digits belong to the same number no further limitations disallow appending.

Minus sign

By definition minus signs can only stand at the first position of a number. Thus, any occurrence signalises that maybe so far stored digits in SubString finally represent a number. If ActualValue is equal to ValueNo then SubString contains a copy of the requested number (and) isExtracted will be set to state termination of search.

Figure 3: Algorithmic treatment when a minus sign was found.

Whenever SubString does not contain the requested number its content can be discarded.

A decision has to be made whether minus signs are delimiters (IncludeMinus = FALSE) or are included by the grammar (IncludeMinus = TRUE). In the first case SubString remains empty at this point, while in the latter the minus sign is the first symbol of any potentially following number. In any case, with the beginning of a new number the state variable wasDecSepFound also needs to be reset.

Decimal separator

Decimal separators are usually flanked by digits on both sides. Nevertheless, notation allows writing numbers without ciphers on one side of its decimal separator but never both. Numbers like:

-213,
213,
,213
-,213

are true values, although successive or leading zeroes were omitted in this notation. For this reason an increment of ActualNumber was linked to the finding of ciphers (see above in subsection Cipher).

Figure 4: Algorithmic treatment when a decimal separator was found

First differentiation after finding a decimal separator in aString is whether these symbols represent delimiters or are evaluated as belonging to a number. When point and comma are delimiters they signalise that maybe existing digits in SubString are finalised. If then ActualValue = ValueNo a copy of the requested number is found in SubString like in the above handling of minus signs. Then isExtracted will be set to signalise termination of search. Otherwise the content of SubString is discarded and thus prepared for the next try.

When decimal separators are no delimiters it is important to control whether one of these is already present in SubString. Only one decimal separator per number is valid. When the state variable wasDecSepFound is not set the decimal separator is attached to SubString. Using the variable DecimalSeparator of the operating system ensures proper conversion from string to float later.

When the decimal separator is no delimiter and one of this symbols is already included in SubString again the option arises, whether ActualValue = ValueNo to terminate search. Otherwise Substring will be deleted, a decimal separator is used as first symbol therein and wasDecSepFound set.

Source code

Function GetNthValueInString (aString: String; ValueNo: Integer; IncludeMinus: Boolean; IncludeDecSep: Boolean): String;
// This function extracts the n'th value given in the variable ValueNo out
// of a string. With the boolean variables one can specify whether negative
// values or decimal separators are evaluated; first reappearance of a decimal
// separator or minus sign within a number reinitialises this char as beginning
// of the next number
// (c) Dr. Jan Schulz, 2007; www.code10.info
Var StringPos      : Integer;
    aChar          : Char;
    ActualNumber   : Integer;
    SubString      : String;
    isExtracted    : Boolean;
    wasDecSepFound : Boolean;
Begin
  // initialise variables
  Result         := '';    // return value when requested value is not found
  ActualNumber   := 0;     // begin with no value found
  StringPos      := 0;     // start at position left of first char
  SubString      := '';    // start with an empty substring variable
  isExtracted    := False; // requested value is not found so far
  wasDecSepFound := False; // no DecimalSeparator is found

  // are characters in the string?
  If Length (aString) > 0 THen
  Begin
    Repeat
      // move to next character position in the string
      Inc (StringPos);
      // and get it it
      aChar := aString[StringPos];

      Case aChar of
        '0'..'9' : Begin
                     // was a new number found?
                     If (SubString = '') Or
                        (SubString = '-') Or
                        (SubString = DecimalSeparator) Or
                        (SubString = ('-' + DecimalSeparator)) THen Inc (ActualNumber);
                     // attach cipher to substring
                     SubString := SubString + aChar;
                   end;

        '-'      : Begin
                     // is the already found number requested one?
                     If ActualNumber = ValueNo THen
                     Begin
                       isExtracted := True;
                     end
                     Else
                     // otherwise evaluate minus sign
                     Begin
                       //do we care for minus signs?
                       If IncludeMinus THen
                       Begin
                         // minus sign is first symbol for SubString
                         SubString      := '-';
                         wasDecSepFound := False;
                       end
                       Else
                       // otherwise beginning of a new number
                       Begin
                         // reset states and SubString
                         SubString      := '';
                         wasDecSepFound := False;
                       end;
                     end;
                   end;

        '.', ',' : Begin
                     // do we care for decimal separators?
                     If IncludeDecSep THen
                     Begin
                       // did we already find a decimal separator?
                       If wasDecSepFound THen
                       Begin
                         // terminate when the requested number was found
                         If ActualNumber = ValueNo THen
                         Begin
                           isExtracted := True;
                         end
                         Else
                         // otherwise interpret char as first of a new number
                         Begin
                           // attach decimal separator and do not reset the wasDecSepFound
                           wasDecSepFound := True;
                           SubString      := DecimalSeparator;
                         end;
                       end
                       Else
                       // first decimal separator in number is attached and presence notified
                       Begin
                         wasDecSepFound := True;
                         SubString      := SubString + DecimalSeparator;
                       end;
                     end
                     Else
                     // decimal separators are delimiters
                     Begin
                       // is the already found number the requested one?
                       If ActualNumber = ValueNo THen
                       Begin
                         isExtracted := True;
                       end
                       Else
                       // otherwise reset substring for next try
                       Begin
                         SubString      := '';
                         wasDecSepFound := False;
                       end;
                     end;
                   end;

        // all other chars are delimiters
        Else
          // is the already found number the requested one?
          If ActualNumber = ValueNo THen
          Begin
            isExtracted := True;
          end
          Else
          //otherwise reset substring for next try
          Begin
            SubString      := '';
            wasDecSepFound := False;
          end;
        end;

    Until isExtracted Or (StringPos >= Length (aString));
  end;

  If ActualNumber = ValueNo THen Result := SubString;
end;

Examples

Observing the variables while parsing the second line of the fictive example above shown in the Objective section above demonstrates internal functioning of the state machine. The state of the variables is displayed after the actual symbol is processed. This point is marked with an asterisk in Figure 1.

Example 1: A call of the function with:

aString = ’0.5 –1324 00101010 TRUE 0.213’
ValueNo = 1
IncludeMinus = TRUE
IncludeDecSep = TRUE

results stepwise in:

*aChar*	*StringPos*	*ActualNumber*	*isExtracted*	*wasDecSepFound*	*SubString*
0	1	1	FALSE	FALSE	0
.	2	1	FALSE	TRUE	0.
5	3	1	FALSE	TRUE	0.5
	4	1	TRUE	TRUE	0.5

After the fourth step a symbol that does not belong to a number causes evaluation of the boolean expression (ActualNumber = ValueNo). As the outcome is TRUE, isExtracted is set the content of SubString is not discarded and causes the state machine to halt. Thereafter, the same evaluation returns content of substring as result.

Example 2: Calling the function with the parameters:

aString = ’0.5 –1324 00101010 TRUE 0.213’
ValueNo = 4
IncludeMinus = True
IncludeDecSep = True

successively results in:

*aChar*	*StringPos*	*ActualNumber*	*isExtracted*	*wasDecSepFound*	*SubString*
0	1	1	FALSE	False	0
.	2	1	FALSE	TRUE	0.
5	3	1	FALSE	TRUE	0.5
	4	1	FALSE	False
-	5	1	FALSE	False	-
1	6	2	FALSE	False	-1
3	7	2	FALSE	False	-13
2	8	2	FALSE	False	-132
4	9	2	FALSE	False	-1324
	10	2	FALSE	False
0	11	3	FALSE	False	0
0	12	3	FALSE	False	00
1	13	3	FALSE	False	001
0	14	3	FALSE	False	0010
1	15	3	FALSE	False	00101
0	16	3	FALSE	False	001010
1	17	3	FALSE	False	0010101
0	18	3	FALSE	False	00101010
	19	3	FALSE	False
T	20	3	FALSE	False
R	21	3	FALSE	False
U	22	3	FALSE	False
E	23	3	FALSE	False
	24	3	FALSE	False
0	25	4	FALSE	False	0
.	26	4	FALSE	True	0.
2	27	4	FALSE	True	0.2
1	28	4	FALSE	True	0.21
3	29	4	FALSE	True	0.213

When StringPos is as high as the number of symbols in aString the algorithm terminates the loop. As already mentioned isExtracted is not set to TRUE at this point. Consequently, the boolean expression (ActualNumber = ValueNo) decides whether the content of SubString is returned as result.

Homework

The described algorithm works correct under the terms described above. However, the critical reader may be interested in analysing the above explanations in more detail. He will find two sites where state variables are modified although not necessary. Additionally, the code may show some redundancies.

Last Updated on Friday, 18 March 2011 18:26

Extract the nth number from a string