Article
10/02/2015

October 2015

Volume 30 Number 10

C# - A Split-and-Merge Expression Parser in C#

I published a new algorithm for parsing a mathematical expression in C++ in the May and July 2015 issues of CVu Journal (see items 1 and 2 in “References”). It took two articles because one astute reader, Silas S. Brown, found a bug in the first algorithm implementation, so I had to make some modifications to it. Thanks to that reader, the algorithm became more mature. I’ve also fixed a few smaller bugs since then. Now I’m going to provide an implementation of the corrected algorithm in C#.

It’s not too likely you’ll ever have to write code to parse a mathematical expression, but the techniques used in the algorithm can be applied to other scenarios, as well, such as parsing non-standard strings. Using this algorithm you can also easily define new functions that do whatever you wish (for example, make a Web request to order a pizza). With small adjustments, you can also create your own C# compiler for your new custom scripting language. Moreover, you just might find the algorithm interesting in its own right.

The Edsger Dijkstra algorithm, published more than 50 years ago in 1961, is often used for parsing mathematical expressions (see item 3 in “References”). But it’s good to have an alternative that, though it has the same time complexity, is, in my opinion, easier to implement and to extend.

Note that I’m going to use the “virtual constructor” idiom for the function factory implementation. This idiom was introduced in C++ by James Coplien (see item 4 in “References”). I hope you’ll find its use interesting in the C# world, as well.

The Split-and-Merge Algorithm

The demo program in Figure 1 illustrates the split-and-merge algorithm for parsing a mathematical expression.

Figure 1 A Demo Run of the Split-and-Merge Algorithm

The algorithm consists of two steps. In the first step the string containing the expression is split into a list of Cell objects, where each Cell is defined as follows:

class Cell
{
  internal Cell(double value, char action)
  {
    Value = value;
    Action = action;
  }
  internal double Value  { get; set; }
  internal char   Action { get; set; }
}

The action is a single character that can be any of the mathematical operators: ‘*’ (multiplication), ‘/’ (division), ‘+’ (addition), ‘-’ (subtraction) or ‘^’ (power), or a special character denoting the end of an expression, which I hardcoded as ‘).’ The last element in the list of cells to be merged will always have the special action ‘),’ that is, no action, but you can use any other symbol or a parenthesis instead.

In the first step, the expression is split into tokens that are then converted into cells. All tokens are separated by one of the mathematical expressions or a parenthesis. A token can be either a real number or a string with the name of a function. The ParserFunction class defined later takes care of all of the functions in the string to be parsed, or for parsing a string to a number. It may also call the whole string parsing algorithm, recursively. If there are no functions and no parentheses in the string to parse, there will be no recursion.

In the second step, all the cells are merged together. Let’s see the second step first because it’s a bit more straightforward than the first one.

Merging a List of Cells

The list of cells is merged one by one according to the priorities of the actions; that is, the mathematical operators. These priorities are calculated as follows:

static int GetPriority(char action)
{
  switch (action) {
    case '^': return 4;
    case '*':
    case '/': return 3;
    case '+':
    case '-': return 2;
  }
  return 0;
}

Two cells can be merged if and only if the priority of the action of the cell on the left isn’t lower than the priority of the action of the cell next to it:

static bool CanMergeCells(Cell leftCell, Cell rightCell)
{
  return GetPriority(leftCell.Action) >= GetPriority(rightCell.Action);
}

Merging cells means applying the action of the left cell to the values of the left cell and the right cell. The new cell will have the same action as the right cell, as you can see in Figure 2.

Figure 2 Merge Cells Method

static void MergeCells(Cell leftCell, Cell rightCell)
{
  switch (leftCell.Action)
  {
    case '^': leftCell.Value = Math.Pow(leftCell.Value, rightCell.Value);
      break;
    case '*': leftCell.Value *= rightCell.Value;
      break;
    case '/': leftCell.Value /= rightCell.Value;
      break;
    case '+': leftCell.Value += rightCell.Value;
      break;
    case '-': leftCell.Value -= rightCell.Value;
      break;
  }
  leftCell.Action = rightCell.Action;
}

For example, merging Cell(8, ‘-’) and Cell(5, ‘+’) will lead to a new Cell(8 – 5, ‘+’) = Cell (3, ‘+’).

But what happens if two cells can’t be merged because the priority of the left cell is lower than the right cell? What happens then is a temporary move to the next, right cell, in order to try to merge it with the cell next to it, and so on, recursively. As soon as the right cell has been merged with the cell next to it, I return to the original, left cell, and try to remerge it with the newly created right cell, as illustrated in Figure 3.

Figure 3 Merge a List of Cells

static double Merge(Cell current, ref int index, List<Cell> listToMerge,
                    bool mergeOneOnly = false)
{
  while (index < listToMerge.Count)
  {
    Cell next = listToMerge[index++];
    while (!CanMergeCells(current, next))
    {
      Merge(next, ref index, listToMerge, true /* mergeOneOnly */);
    }
    MergeCells(current, next);
    if (mergeOneOnly)
    {
      return current.Value;
    }
  }
  return current.Value;
}

Note that, from the outside, this method is called with the mergeOneOnly parameter set to false, so it won’t return before completing the whole merge. In contrast, when the merge method is called recursively (when the left and the right cells can’t be merged because of their priorities), mergeOneOnly will be set to true because I want to return to where I was as soon as I complete an actual merge in the MergeCells method.

Also note that the value returned from the Merge method is the actual result of the expression.

Splitting an Expression into a List of Cells

The first part of the algorithm splits an expression into a list of cells. Mathematical operator precedence isn’t taken into account in this step. First, the expression is split into a list of tokens. All tokens are separated by any mathematical operator or by an open or close parenthesis. The parentheses may, but don’t have to, have an associated function; for example, “1- sin(1-2)” has an associated function, but “1- (1-2)” doesn’t.

First, let’s look at what happens when there are no functions or parentheses, just an expression containing real numbers and mathematical operators between them. In this case, I just create cells consisting of a real number and a consequent action. For example, splitting “3-2*4” leads to a list consisting of three cells:

Split (“3-2*4”) = { Cell(3, ‘-’), Cell(2, ‘*‘), Cell(3, ‘)’) }

The last cell will always have a special END_ARG action, which I define as:

const char END_ARG = ')';

It can be changed to anything else, but in that case the corresponding opening parenthesis START_ARG must be taken into account, as well, which I define as:

const char START_ARG = '(';

If one of the tokens is a function or an expression in parentheses, the whole split-and-merge algorithm is applied to it using recursion. For example, if the expression is “(3-1)-1,” the whole algorithm in the parentheses is applied to the expression first:

Split(“(3-1)-1”) = Split( “[SplitAndMerge(“3-1”)] - 1”) = { Cell(2, ‘-’), Cell(1, ‘)’) }

The function that performs the splitting is LoadAndCalculate, as shown in Figure 4.

Figure 4 LoadAndCalculate Method

public static double LoadAndCalculate(string data, ref int from, char to = END_LINE)
{
  if (from >= data.Length || data[from] == to)
  {
    throw new ArgumentException("Loaded invalid data: " + data);
  }
  List<Cell> listToMerge = new List<Cell>(16);
  StringBuilder item = new StringBuilder();
  do // Main processing cycle of the first part.
  {
    char ch = data[from++];
    if (StillCollecting(item.ToString(), ch, to))
    { // The char still belongs to the previous operand.
      item.Append(ch);
      if (from < data.Length && data[from] != to)
      {
        continue;
      }
    }
    // I am done getting the next token. The getValue() call below may
    // recursively call loadAndCalculate(). This will happen if extracted
    // item is a function or if the next item is starting with a START_ARG '(.'
    ParserFunction func = new ParserFunction(data, ref from, item.ToString(), ch);
    double value = func.GetValue(data, ref from);
    char action = ValidAction(ch) ? ch
                                  : UpdateAction(data, ref from, ch, to);
    listToMerge.Add(new Cell(value, action));
    item.Clear();
  } while (from < data.Length && data[from] != to);
  if (from < data.Length && (data[from] == END_ARG || data[from] == to))
  { // This happens when called recursively: move one char forward.
  from++;
  }
  Cell baseCell = listToMerge[0];
  int index = 1;
  return Merge(baseCell, ref index, listToMerge);
}

The LoadAndCalculate method adds all of the cells to the listToMerge list and then calls the second part of the parsing algorithm, the merge function. The StringBuilder item will hold the current token, adding characters to it one by one as soon as they’re read from the expression string.

The StillCollecting method checks if the characters for the current token are still being collected. This isn’t the case if the current character is END_ARG or any other special “to” character (such as a comma if the parsing arguments are separated by a comma; I’ll provide an example of this using the power function later). Also, the token isn’t being collected anymore if the current character is a valid action or a START_ARG:

static bool StillCollecting(string item, char ch, char to)
{
  char stopCollecting = (to == END_ARG || to == END_LINE) ?
                         END_ARG : to;
  return (item.Length == 0 && (ch == '-' || ch == END_ARG)) ||
        !(ValidAction(ch) || ch == START_ARG || ch == stopCollecting);
}
static bool ValidAction(char ch)
{
  return ch == '*' || ch == '/' || ch == '+' || ch == '-' || ch == '^';
}

I know that I’m done collecting the current token as soon as I get a mathematical operator described in the ValidAction method, or parentheses defined by the START_ARG or END_ARG constants. There’s also a special case involving a “-” token, which is used to denote a number starting with a negative sign.

At the end of this splitting step, all of the subexpressions in parentheses and all of the function calls are eliminated via the recursive calls to the whole algorithm evaluation. But the resulting actions of these recursive calls will always have the END_ARG action, which won’t be correct in the global expression scope if the calculated expression isn’t at the end of the expression to be evaluated. This is why the action needs to be updated after each recursive call, like so:

char action = ValidAction(ch) ? ch
                                : UpdateAction(data, ref from, ch);

The code for the updateAction method is in Figure 5.

Figure 5 Update Action Method

static char UpdateAction(string item, ref int from, char ch, char to)
{
  if (from >= item.Length || item[from] == END_ARG || item[from] == to)
  {
    return END_ARG;
  }
  int index = from;
  char res = ch;
  while (!ValidAction(res) && index < item.Length)
  { // Look to the next character in string until finding a valid action.
    res = item[index++];
  }
  from = ValidAction(res) ? index
                          : index > from ? index - 1
                                         : from;
  return res;
}

The actual parsing of the extracted token will be in the following code of Figure 4:

ParserFunction func = new ParserFunction(data, ref from, item.ToString(), ch);
double value = func.GetValue(data, ref from);

If the extracted token isn’t a function, this code will try to convert it to a double. Otherwise, an appropriate, previously registered function will be called, which can in turn recursively call the LoadAndCalculate method.

User-Defined and Standard Functions

I decided to implement the function factory using the virtual constructor idiom that was first published by James Coplien (see item 4 in “References”). In C#, this is often implemented using a factory method (see item 5 in “References”) that uses an extra factory class to produce the needed object. But Coplien’s older design pattern doesn’t need an extra factory façade class and instead just constructs a new object on the fly using the implementation member m_impl that’s derived from the same class:

private ParserFunction m_impl;

The special internal constructor initializes this member with the appropriate class. The actual class of the created implementation object m_impl depends on the input parameters, as shown in Figure 6.

Figure 6 Virtual Constructor

internal ParserFunction(string data, ref int from, string item, char ch)
{
  if (item.Length == 0 && ch == Parser.START_ARG)
  {
    // There is no function, just an expression in parentheses.
    m_impl = s_idFunction;
    return;
  }
  if (m_functions.TryGetValue(item, out m_impl))
  {
    // Function exists and is registered (e.g. pi, exp, etc.).
    return;
  }
  // Function not found, will try to parse this as a number.
  s_strtodFunction.Item = item;
  m_impl = s_strtodFunction;
}

A dictionary is used to hold all of the parser functions. This dictionary maps the string name of the function (such as “sin”) to the actual object implementing this function:

private static Dictionary<string, ParserFunction> m_functions =
      new Dictionary<string, ParserFunction>();

Users of the parser can add as many functions as they wish by calling the following method on the base ParserFunction class:

public static void AddFunction(string name, ParserFunction function)
{
  m_functions[name] = function;
}

The GetValue method is called on the created ParserFunction, but the real work is done in the implementation function, which will override the evaluate method of the base class:

public double GetValue(string data, ref int from)
{
  return m_impl.Evaluate(data, ref from);
}
protected virtual double Evaluate(string data, ref int from)
{
  // The real implementation will be in the derived classes.
  return 0;
}

The function implementation classes, deriving from the ParserFunction class, won’t be using the internal constructor in Figure 6. Instead, they’ll use the following constructor of the base class:

public ParserFunction()
{
  m_impl = this;
}

Two special “standard” functions are used in the ParserFunction constructor in Figure 6:

private static IdentityFunction s_idFunction = new IdentityFunction();
private static StrtodFunction s_strtodFunction = new StrtodFunction();

The first is the identity function; it will be called to parse any expression in parentheses that doesn’t have an associated function:

class IdentityFunction : ParserFunction
{
  protected override double Evaluate(string data, ref int from)
  {
    return Parser.LoadAndCalculate(data, ref from, Parser.END_ARG);
  }
}

The second function is a “catchall” that’s called when no function is found that corresponds to the last extracted token. This will happen when the extracted token is neither a real number nor an implemented function. In the latter case, an exception will be thrown:

class StrtodFunction : ParserFunction
{
  protected override double Evaluate(string data, ref int from)
  {
    double num;
    if (!Double.TryParse(Item, out num)) {
      throw new ArgumentException("Could not parse token [" + Item + "]");
    }
    return num;
  }
  public string Item { private get; set; }
}

All other functions can be implemented by the user; the simplest is a pi function:

class PiFunction : ParserFunction
{
  protected override double Evaluate(string data, ref int from)
  {
    return 3.141592653589793;
  }
}

A more typical implementation is an exp function:

class ExpFunction : ParserFunction
{
  protected override double Evaluate(string data, ref int from)
  {
    double arg = Parser.LoadAndCalculate(data, ref from, Parser.END_ARG);
    return Math.Exp(arg);
  }
}

Earlier I said I’d provide an example using a power function, which requires two arguments separated by a comma. This is how to write a function requiring multiple arguments separated by a custom separator:

class PowFunction : ParserFunction
{
  protected override double Evaluate(string data, ref int from)
  {
    double arg1 = Parser.LoadAndCalculate(data, ref from, ',');
    double arg2 = Parser.LoadAndCalculate(data, ref from, Parser.END_ARG);
    return Math.Pow(arg1, arg2);
  }
}

Any number of functions can be added to the algorithm from the user code, like so:

ParserFunction.AddFunction("pi",  new PiFunction());
ParserFunction.AddFunction("exp", new ExpFunction());
ParserFunction.AddFunction("pow", new PowFunction());

Wrapping Up

The split-and-merge algorithm presented here has O(n) complexity if n is the number of characters in the expression string. This is so because each token will be read only once during the splitting step and then, in the worst case, there will be at most 2(m - 1) – 1 comparisons in the merging step, where m is the number of cells created in the first step.

So the algorithm has the same time complexity as the Dijkstra algorithm (see item 3 in “References”). It might have a slight disadvantage compared to the Dijkstra algorithm because it uses recursion. On the other hand, I believe the split-and-merge algorithm is easier to implement, precisely because of the recursion, and easier also to extend with the custom syntax, functions, and operators.

References

V. Kaplan, “Split and Merge Algorithm for Parsing Mathematical Expressions,” CVu, 27-2, May 2015, bit.ly/1Jb470l
V. Kaplan, “Split and Merge Revisited,” CVu, 27-3, July 2015, bit.ly/1UYHmE9
E. Dijkstra, Shunting-yard algorithm, bit.ly/1fEvvLI
J. Coplien, “Advanced C++ Programming Styles and Idioms” (p. 140), Addison-Wesley, 1992
E. Gamma, R. Helm, R. Johnson and J. Vlissides, “Design Patterns: Elements of Reusable Object-Oriented Software,” Addison-Wesley Professional Computing Series, 1995

Vassili Kaplan is a former Microsoft Lync developer. He is passionate about programming in C# and C++. He currently lives in Zurich, Switzerland, and works as a freelancer for various banks. You can reach him at iLanguage.ch.

Thanks to the following Microsoft technical expert for reviewing this article: James McCaffrey