January 2014

Volume 29 Number 1

Windows with C++ : Using Regular Expressions with Modern C++

Kenny Kerr | January 2014

Kenny Kerr“C++ is a language for developing and using elegant and efficient abstractions.”—Bjarne Stroustrup

This quote from the creator of C++ really sums up what I love about the language. I get to develop elegant solutions to my problems by combining the language features and programming styles that I deem most suitable to the task.

C++11 introduced a long list of features that are in themselves quite exciting, but if all you see is a list of isolated features, then you’re missing out. The combination of these features makes C++ into the powerhouse that many have grown to appreciate. I’m going to illustrate this point by showing you how to use regular expressions with modern C++. The C++11 standard introduced a powerful regular expression library, but if you use it in isolation—using a traditional C++ programming style—you might find it somewhat tiresome. Unfortunately, this is the way that most of the C++11 libraries tend to be introduced. However, there is some merit in such an approach. If you were looking for a concise example of using some new library, it would be rather overwhelming to be forced into comprehending a slew of new language features at the same time. Still, the combination of C++ language and library features really turns C++ into a productive programming language.

To keep the examples focused on C++ and not on regular expressions, I’m necessarily going to use very simplistic patterns. You might wonder why I’m using regular expressions for such trivial problems, but it helps avoid getting lost in the mechanics of expression processing. Here’s a simple example: I’d like to match strings of names where the names might be formatted “Kenny Kerr” or “Kerr, Kenny.” I need to identify the first name and family name and then print them out in some consistent manner. First up is the target string:

char const s[] = "Kerr, Kenny";

To keep things simple ,I’ll stick to char strings and I’ll avoid using the standard library’s basic_string class except to illustrate the results of certain matches. There’s nothing wrong with basic_string, but I find that most of the work I do with regular expressions tends to be targeted at memory-mapped files. Copying the contents of these files into string objects would only serve to slow down my applications. The standard library’s regular expression support is indifferent and perfectly happy to process sequences of characters without concern for how they’re managed.

The next thing I’ll need is a match object:

auto m = cmatch {};

This is really a collection of matches. The cmatch is a match_results class template that’s been specialized for char strings. At this point, the match “collection” is empty:

ASSERT(m.empty());

I’ll also need a pair of strings to receive the results:

string name, family;

I can now call the regex_match function:

if (regex_match(s, m, regex { R"((\w+) (\w+))" }))
{
}

This function attempts to match the pattern against the entire character sequence. This is in contrast to the regex_search function that’s quite happy to search for a match at any point within the string. I’m just creating the regex object “inline” for brevity, but this is not without cost. If you were going to match this regular expression repeatedly, you might be better off creating the regex object once and then holding on to it for the life of your application. The preceding pattern matches the names using the “Kenny Kerr” format. Assuming it’s a match, I can just copy out the substrings:

name   = m[1].str();
family = m[2].str();

The subscript operator returns the specified sub_match object. An index of zero represents the match as a whole, while subsequent indexes pinpoint any groups identified in the regular expression. Neither the match_results nor the sub_match object will create or allocate a substring. Instead, they delineate the range of characters with a pointer or iterator to the beginning and end of the match or submatch, producing the usual half-open range favored by the standard library. In this case, I’m explicitly calling the str method on each sub_match to create a copy of each submatch as string objects.

That handles the first possible format. For the second I need another call to regex_match with the alternative pattern (technically, you could match both formats with a single expression, but that’s beside the point):

else if (regex_match(s, m, regex { R"((\w+), (\w+))" }))
{
  name   = m[2].str();
  family = m[1].str();
}

This pattern matches the names using the “Kerr, Kenny” format. Notice that I’ve had to reverse the indices, as the first group represented in this regular expression identifies the family name while the second identifies the first name. That’s about it for the regex_match function. Figure 1provides the complete listing for reference.

Figure 1 The regex_match Reference Example

char const s[] = "Kerr, Kenny";
auto m = cmatch {};
string name, family;
if (regex_match(s, m, regex { R"((\w+) (\w+))" }))
{
  name   = m[1].str();
  family = m[2].str();
}
else if (regex_match(s, m, regex { R"((\w+), (\w+))" }))
{
  name   = m[2].str();
  family = m[1].str();
}
else
{
  printf("No match\n");
}

I don’t know about you, but the code in Figure 1 looks tedious to me. While the regular expression library is certainly powerful and flexible, it’s not particularly elegant. I need to know about match_results and sub_match objects. I need to remember how this “collection” is indexed and how to extract the results. I could avoid making the copies, but it quickly becomes onerous.

I’ve already used a number of new C++ language features that you may or may not have come across before, but nothing should be overly startling. Now I want to show how you can use variadic templates to really spice up your regular expression usage. Rather than diving right in with more language features, I’m going to start by showing you a simple abstraction to simplify text processing so I can keep this practical and elegant.

First, I’ll define a simple type to represent a sequence of characters that aren’t necessarily null-terminated. Here’s the strip class:

struct strip
{
  char const * first;
  char const * last;
    strip(char const * const begin,
          char const * const end) :
      first { begin },
      last  { end }
    {}
    strip() : strip { nullptr, nullptr } {}
};

There are undoubtedly numerous such classes that I might reuse, but I find it helps to avoid too many dependencies when producing simple abstractions.

The strip class doesn’t do much, but I’ll augment it with a set of nonmember functions. I’ll start with a pair of functions to define the range generically:

auto begin(strip const & s) -> char const *
{
  return s.first;
}
auto end(strip const & s) -> char const *
{
  return s.last;
}

Although not strictly necessary to this example, I find this technique provides a worthy measure of consistency with the standard library’s containers and algorithms. I’ll get back to the begin and end functions in a moment. Up next is the make_strip helper function:

template <unsigned Count>
auto make_strip(char const (&text)[Count]) -> strip
{
  return strip { text, text + Count - 1 };
}

This function comes in handy when attempting to create a strip from a string literal. For example, I can initialize a strip as follows:

auto s = make_strip("Kerr, Kenny");

Next, it’s often useful to determine the length or size of the strip:

auto size(strip const & s) -> unsigned
{
  return end(s) - begin(s);
}

Here you can see I’m simply reusing the begin and end functions to avoid a dependency on the strip’s members. I could protect the members of the strip class. On the other hand, it’s often helpful to be able to manipulate them directly from within an algorithm. Still, if I don’t need to take a hard dependency, I won’t.

Obviously, it’s simple enough to create a standard string from a strip:

auto to_string(strip const & s) -> string
{
  return string { begin(s), end(s) };
}

This might come in handy if some of the results outlive the original character sequences. That rounds out the basic strip handling. I can initialize a strip and determine its size—and thanks to the begin and end functions, I can use a range-for statement to iterate over its characters:

auto s = make_strip("Kenny Kerr");
for (auto c : s)
{
  printf("%c\n", c);
}

When I first wrote the strip class, I was hoping I could call its members “begin” and “end” instead of  “first” and “last.” The trouble is that the compiler, when confronted with a range-for statement, first attempts to find suitable members that may be called as functions. If the target range or sequence doesn’t include any members called begin and end, then the compiler looks for a suitable pair in the enclosing scope. The trouble is that if the compiler finds members called begin and end but they aren’t suitable, it won’t attempt to look any further. This might seem shortsighted, but C++ has complex name-lookup rules, and anything else would make it even more confusing and inconsistent.

The strip class is a simple little construct, but it doesn’t do much in itself. I’ll now combine it with the regular expression library to produce an elegant abstraction. I want to hide the mechanics of the match object, the tedious part of expression processing. This is where variadic templates come in. The key to understanding variadic templates is realizing you can separate the first argument from the rest. This typically results in compile-time recursion. I can define a variadic template to unpack a match object into subsequent arguments:

template <typename... Args>
auto unpack(cmatch const & m,
            Args & ... args) -> void
{
  unpack<sizeof...(Args)>(m, args...);
}

The “typename...” indicates that Args is a template parameter pack. The corresponding “...” in the type of args indicates that args is a function parameter pack. The “sizeof...” expression determines the number of elements in the parameter pack. The final “...” following args tells the compiler to expand the parameter pack into its sequence of elements.

The type of each argument may be different, but in this case, each will be a non-const strip reference. I’m using a variadic template so an unknown number of arguments can be supported. So far the unpack function doesn’t appear to be recursive. It forwards its arguments to another unpack function with an additional template argument:

template <unsigned Total, typename... Args>
auto unpack(cmatch const & m,
            strip & s,
            Args & ... args) -> void
{
  auto const & v = m[Total - sizeof...(Args)];
  s = { v.first, v.second };
  unpack<Total>(m, args...);
}

However, this unpack function separates the first argument following the match object from the rest. That’s compile-time recursion in action. Assuming the args parameter pack isn’t empty, it calls itself with the rest of the arguments. Eventually the sequence of arguments becomes empty and a third unpack function is needed to deal with this conclusion:

template <unsigned>
auto unpack(cmatch const &) -> void {}

This function doesn’t do anything. It merely acknowledges the fact that the parameter pack may be empty. The previous unpack functions hold the key to unpacking the match object. The first unpack function captured the original number of elements in the parameter pack. This is necessary because each recursive call will effectively produce a new parameter pack with a diminishing size. Notice how I’m subtracting the size of the parameter pack from the original total. Given this total or stable size, I can index into the match collection to retrieve the individual submatches and copy their respective bounds into the variadic arguments.

That takes care of unpacking the match object. Although not required, I still find it helpful to hide the match object itself if it isn’t needed directly—for example, if it’s only needed to access the match prefix and suffix. I’ll wrap up the whole thing to provide a simpler match abstraction:

template <typename... Args>
auto match(strip const & s,
           regex const & r,
           Args & ... args) -> bool
{
  auto m = cmatch {};
  if (regex_match(begin(s), end(s), m, r))
  {
    unpack<sizeof...(Args)>(m, args...);
  }
    return !m.empty();
}

This function is also a variadic template but isn’t in itself recursive. It merely forwards its arguments to the original unpack function for processing. It also takes care of providing a local match object and defining the search sequence in terms of strip’s begin and end helper functions. An almost identical function can be written to accommodate regex_search instead of regex_match. I can now rewrite the example from Figure 1 far more simply:

auto const s = make_strip("Kerr, Kenny");
strip name, family;
if (match(s, regex { R"((\w+) (\w+))"  }, name,   family) ||
    match(s, regex { R"((\w+), (\w+))" }, family, name))
{
  printf("Match!\n");
}

How about iteration? The unpack functions also come in handy for handling the match results of an iterative search. Imagine a string with the canonical “Hello world” in a variety of languages:

auto const s =
  make_strip("Hello world/Hola mundo/Hallo wereld/Ciao mondo");

I can match each one with the following regular expression:

auto const r = regex { R"((\w+) (\w+))" };

The regular expression library provides the regex_iterator to iterate through the matches, but using iterators directly can become tedious. One option is to write a for_each function that calls a predicate for each match:

template <typename F>
auto for_each(strip const & s,
              regex const & r,
              F callback) -> void
{
  for (auto i = cregex_iterator { begin(s), end(s), r };
       i != cregex_iterator {};
       ++i)
  {
    callback(*i);
  }
}

I could then call this function with a lambda expression to unpack each match:

for_each(s, r, [] (cmatch const & m)
{
  strip hello, world;
  unpack(m, hello, world);
});

This certainly works, but I always find it frustrating that I can’t easily break out of this type of loop construct. The range-for statement provides a more convenient alternative. I’ll begin by defining a simple iterator range that the compiler will recognize to implement the range-for loop:

template <typename T>
struct iterator_range
{
  T first, last;
  auto begin() const -> T { return first; }
  auto end() const -> T { return last; }
};

I can now write a simpler for_each function that just returns an iterator_range:

auto for_each(strip const & s,
              regex const & r) -> iterator_range<cregex_iterator>
{
  return
  {
    cregex_iterator { begin(s), end(s), r },
    cregex_iterator {}
  };
}

The compiler will take care of producing the iteration and I can simply write a range-for statement with minimal syntactic overhead, breaking early if I so choose:

for (auto const & m : for_each(s, r))
{
  strip hello, world;
  unpack(m, hello, world);
  printf("'%.*s' '%.*s'\n",
         size(hello), begin(hello),
         size(world), begin(world));
}

The console presents the expected results:

'Hello' 'world'
'Hola' 'mundo'
'Hallo' 'wereld'
'Ciao' 'mondo'

C++11 and beyond provide an opportunity to revitalize C++ software development with a modern style of programming that lets you produce elegant and efficient abstractions. The regular expression grammar can humble even the most seasoned of developers. Why not spend a few minutes developing a more elegant abstraction? At least the C++ portion of your task will be a pleasure!


Kenny Kerr is a computer programmer based in Canada, an author for Pluralsight and a Microsoft MVP. He blogs at kennykerr.ca and you can follow him on Twitter at twitter.com/kennykerr.