Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

C/C++

Regular Expressions


Pete is a consultant specializing in library design and implementation. He works primarily for Dinkumware Ltd. He has been a member of the C++ Standards Committee since its inception, and is Project Editor for the C++ Standard. He is writing a book about the newly approved Technical Report on C++ Library Extensions; the book will be published this summer by Addison-Wesley. Pete can be contacted at [email protected].


After 20 years in this profession, it's easy to get cynical. I've seen it all. Most of what's considered new and exciting today is recycled from the past, and it's only new and exciting because programmers are so caught up in themselves that they don't know enough about history to recognize that many ideas only have a promising past.

Niklaus Wirth's less-is-more approach to language design has only caught on in small circles of fanatics. [1] Having written a multiple-precision integer package for Java (today's bastard son of minimalism) I can tell you: Unsigned integral types aren't just a frill.

The program-as-data notion from Lisp has metastasized in C++ in the form of template metaprogramming, based on the discovery that, with suitable trickery, template specialization can be used to express fundamental notions from number theory [2] at compile time. The first demonstration I saw of this was Erwin Unruh's simple program that generated prime numbers by creating a compile-time error when one number isn't an exact divisor of another. [3] This naturally lead to a proliferation of code that was so complex that you had to put a drip pan under your computer to catch the pieces as the compiler melted down. [4] It's gotten better, as compiler writers respond to the stress that this sort of programming imposes, but template metaprogramming today still requires a fairly deep study of compiler error messages that often occur far from the place where the actual coding error, if any, occurred. The book I'm writing about TR1 [5] has programming exercises at the ends of most of the chapters, and in many of those exercise sets, the first exercise is to write code with specific kinds of errors and study the error messages that result.

When I was writing the code and documentation for Dinkumware's implementation of TR1's regular expressions, I viewed regular expression libraries as yet another fad. Sure, regular expressions are useful to search through source files to find the places where some member function is called, or to replace one thing with another, so it's handy to have a regular expression interpreter built in to your text editor. And I use sed scripts to extract platform-specific versions of generic source code. But I didn't see their broader legitimate uses until a few weeks ago, when I started reading Jeffrey Friedl's book, Mastering Regular Expressions. [6] If your use of regular expressions is like mine—an occasional search or search and replace—you may find this book valuable. Many of the problems that I would have solved with an ad hoc parser can be solved more quickly with a regular expression or two. Of course, that development speed comes at a cost: A regular expressions library is not simple to write, and it involves a fair amount of code. But if the code overhead isn't prohibitive and you already have a regular expressions library, go for it.

There are quite a few regular expression libraries available for use with C++. In this column, I look at the one that's probably going to become the de facto standard—the TR1 Regular Expressions Library. [7] TR1, technically, is the "Technical Report on C++ Standard Library Extensions," recently approved for standardization by the ISO in a ballot of national bodies. There's still a year or so of administrative work to be done before the TR is officially published, but while you're waiting, you can get a copy at http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1836.pdf. Keep in mind that technical reports do not change the C++ Standard. Some of what's in TR1 will become part of Standard C++ in the next revision, but much of it will not. However, there are several implementations of TR1 becoming available from various places, [8] so whether its regular expressions library becomes standard or not, it's a handy tool to have available.

A Simple Programming Problem

There are many ways to split a piece of text into individual words, in order to count the number of times each word occurs. You can write an ad hoc parser that steps through the characters looking for word boundaries; you can use the much-maligned, but useful, C function strtok; you can use the string member functions find_first_of and find_first_not_of; and you can write a regular expression. There are many others, too, but that's enough for now—the code samples for this column are already too long. Before looking at the code for these approaches, though, you need to look at how to write and use regular expressions with the TR1 library.

Regular Expression Basics

If you're going to use a regular expression, you first have to write it. Probably the most common way to write regular expressions with the TR1 library is as ordinary C-style strings, although the library lets you write them in several different forms. The one thing to keep in mind when writing a regular expression is that both C++ and the regular expression machinery give special meaning to a backslash character. When you put backslashes in string literals, the C++ compiler gets the first shot at them, so if you want to write a regular expression that uses a backslash, you have to write two backslashes in the string literal. If you're writing a regular expression that should contain a backslash followed by the letter "b," write it like this: "\\b." If you write "\b" instead, you've got a completely different regular expression. The following regular expression matches any one of the characters listed between the square brackets.

const char *expr = "[ ,.\\t\\n;:]";

Once you've written a regular expression, you need to tell the library about it. You do this with an object whose type is an instance of the template basic_regex. Its type argument names the type of character that you'll be using. [9] Because most programs traffic in char or wchar_t character types, the library provides two specializations for those types, named regex and wregex, respectively. You create an object of one of these types by passing a suitable regular expression to its constructor:

regex rgx(expr);

When you search for text that matches a regular expression, you usually need to know where the matching text was found. To do that, you use an object whose type is an instance of the template match_results. Because match_results holds locations in text, its template argument is an iterator type that can point into your character sequence. The most common forms of character sequences are null-terminated character arrays and C++ strings. There are four specializations of match_results, one for each of the char and wchar_t variants of these two types. The specialization cmatch is a synonym for match_results<const char*>; it holds pointers into an array of char. Similarly, wcmatch is a synonym for match_results<const wchar_t*>. The other two, smatch and wsmatch, are synonyms for match_results<string::const_iterator> and match_results<wstring::const_iterator>, respectively. They point into basic_string objects. In this example, we're going to search through an array of char, so we need an object of type cmatch to hold the results.

cmatch match;

Next, you need the text that you're going to search. This can be passed to the search algorithms as a pointer to the first character in a null-terminated array, as a reference to a basic_string object or as a pair of bidirectional iterators that delineate the text to be searched.

const char *tgt = "This is a test.";

Now you're ready to do the search:

if (regex_search(tgt, match, rgx))
cout << "Match found after
'" << match.prefix() << "'\n";
else
cout << "Not found.";

These code snippets are assembled, along with the necessary boilerplate, in Listing One.

#include <regex>
#include <iostream>
using std::tr1::regex; using std::tr1::cmatch;
using std::tr1::regex_search;
using std::cout;

int main()
  { // demonstrate regular expression search
  const char *expr = "[ ,.\\t\\n;:]";
  regex rgx(expr);
  cmatch match;
  const char *tgt = "This is a test.";
  if (regex_search(tgt, match, rgx))
    cout << "Match found after `" << match.prefix() << "`\n";
  else
    cout << "Not found.";
  return 0;
  }
Listing One: A basic regex search.

Solving the Real Problem

Now that we've looked at a sketch of how to use regular expressions, let's look at the problem I described earlier—counting the number of occurrences of each word in some chunk of text. The sample code implements each of the four approaches that I mentioned earlier, including two examples that use regular expressions. Listing Two is the boilerplate for this program. It takes some liberties with good style, to keep the presentation short. The details of the searches are in the rest of the listings.

#include <regex>
#include <algorithm>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <map>
#include <string>
using std::tr1::regex; using std::tr1::cmatch;
using std::tr1::regex_constants::match_continuous;
using std::tr1::sregex_token_iterator;
using std::map;
using std::cout; using std::basic_ostream;
using std::setw; using std::ostream_iterator;
using std::string; using std::copy;

static char text[] =
"The quality of mercy is not strain'd,\n"
"It droppeth as the gentle rain from heaven\n"
"Upon the place beneath: it is twice bless'd;\n"
"It blesseth him that gives and him that takes:\n"
"'Tis mightiest in the mightiest; it becomes\n"
"The throned monarch better than his crown;\n"
"His sceptre shows the force of temporal power,\n"
"The attribute to awe and majesty,\n"
"Wherein doth sit the dread and fear of kings\n";
// William Shakespeare, from "The Merchant of Venice"

    // word separators, as plain text and as regular expression
static char separators[] = " ,.\t\n;:";
static char seps_rgx[] = "[ ,.\\t\\n;:]+";
static char words_rgx[] = "([^ ,.\\t\\n;:]+)([ ,.\\t\\n;:]+|$)";

    // the important types, and the database
typedef map<string, int> counter;
typedef counter::value_type pairs;
static counter word_count;

namespace std { // add inserter to namespace std
template <class Elem, class Alloc>
basic_ostream<Elem, Alloc>& operator<<(
  basic_ostream<Elem, Alloc>& out, const pairs& val)
  {  // insert pair<string, int> into stream
  return out << setw(10) << val.first << ": " << val.second;
  }
}
    // the search functions
//static void use_ad_hoc_parser();
//static void use_strtok();
//static void use_string();
static void use_regex();
//static void use_regex_iter();

static void show_results(const char *title)
  { // show contents of word_count, then clear
  cout << title << " --------------------\n";
  copy(word_count.begin(), word_count.end(),
    ostream_iterator<pairs>(cout, "\n"));
  word_count.clear();
  }
int main()
  { // demonstrate various counting techniques
  use_ad_hoc_parser();
  show_results("ad hoc parser");
  use_strtok();
  show_results("strtok");
  use_string();
  show_results("string");
  use_regex();
  show_results("regular expression");
  use_regex_iter();
  show_results("regular expression iterator");
  return 0;
  }
Listing Two: Counting occurences.

The search using an ad hoc parser is in Listing Three. It advances through the text one character at a time, using a Boolean variable to keep track of whether it's currently in a word. If it's not in a word and the current character is not a word separator, then it must be at the beginning of a word, so it notes the current position and sets the Boolean variable to True. If it's in a word and the current character is a word separator, then it's reached the end of the word. It adds one to the word's frequency count [10] and sets the Boolean variable to False. In all cases, it advances to the next character position.

static void use_ad_hoc_parser()
  { // count word frequencies with ad hoc parser
  const char *txt = text;
  // skip leading whitespace:
  while (*txt && strchr(separators, *txt))
    ++txt;
  bool inword = false;
  string word;
  while (*txt)
    { // classify current character
    bool at_sep = strchr(separators, *txt);
    const char *start;
    if (!inword && !at_sep)
      { // at start of word
      inword = true;
      start = txt;
      }
    else if (inword && at_sep)
      { // at end of word
      inword = false;
      word.assign(start, txt);
      ++word_count[word];
      }
    ++txt;
    }
  }
Listing Three: Search using an ad hoc parser.

The search using strtok is in Listing Four. Because strtok modifies the string that it's searching, the function first makes a copy of the target text. Then it calls strtok to look for the first word. It then enters a loop, which terminates when strtok doesn't find another word. The body of the loop creates a string object from the matched text, and adds one to the word's frequency count. Finally, it calls strtok again, to find the next word.

static void use_strtok()
  { // count word frequencies with strtok
  string word;
  char cpy[sizeof(text)/sizeof(*text)];
  strcpy(cpy, text);
  const char *start = strtok(cpy, separators);
  while (start)
    { // at start of word
    word.assign(start);
    ++word_count[word];
    start = strtok(0, separators);
    }
  }
Listing Four: Search using strtok.

The search using basic_string member functions is in Listing Five. It creates a string object that holds a copy of the target text. [11] Then it calls basic_string::find_first_not_of to find the first character in the string object that isn't a separator, and makes a note of its position. It then enters a loop, which terminates when the current position is at the end of the string. The body of the loop calls basic_string::find_first_of, starting at the current position, to find the next separator character. It calls basic_string::substr to create a new string object that contains the characters from the current position to the next separator, and adds one to that word's frequency count. Then it calls basic_string::find_first_not_of, beginning at the position of the separator, and sets the current position to the location of the next character that isn't a separator.

static void use_string()
  { // count word frequencies with string member functions
  string cpy(text);
  string::size_type pos = cpy.find_first_not_of(separators);
  while (pos != string::npos)
    { // at start of word
    string::size_type end = cpy.find_first_of(separators, pos);
    ++word_count[cpy.substr(pos, end == string::npos ? end : (end - pos))];
    pos = cpy.find_first_not_of(separators, end);
    }
  }
Listing Five: Search using basic_string member functions.

The first search using regular expressions is in Listing Six. It creates two pointers, the first holding the current position in the target text, and the second pointing past the end of the target text. It creates a regular expression object that matches a sequence of one or more separator characters, and uses that object to skip past any initial separator characters. It then changes the regular expression object to hold an expression that matches a sequence of characters that are not separators, followed by one or more separators or by the end of the target text. [12] It then enters a loop, which terminates when the attempted match fails; for instance, when it has reached the end of the target text. The call to regex_search looks for text that matches the regular expression.

static void use_regex()
  { // count word frequencies with regular expression
  const char *begin = text;
  const char *end = text + strlen(text);
  cmatch match;
  // skip leading white space:
  regex rgx(seps_rgx);
  if (regex_search(begin, end, match, rgx, match_continuous))
    begin = match[0].second;
  // start search
  rgx = words_rgx;
  while (regex_search(begin, end, match, rgx, match_continuous))
    { // found a word
    ++word_count[match[1].str()];
    begin = match[0].second;
    }
  }
Listing Six: The first search using regular expressions.

When it succeeds, match[1] holds iterators that point to the beginning and the end of the text that matched the first part of the regular expression; that is, the leading sequence of characters that are not separators. The code adds one to the frequency count for that word, and sets the current position to the character that follows the full match; that is, the first character after the separators.

The second search using regular expressions is in Listing Seven. It's somewhat shorter than Listing Six, and the apparent logic is simpler. It first creates a regular expression iterator object of type sregex_token_iterator. The first two arguments are two iterators that point to the beginning and end of the target text. The next argument is the regular expression to search for. The final argument, with the value -1, says to return text sequences that are separated by sequences that match the regular expression. So later on, the expression *words is a string object that holds the most recently found word. The code adds one to the frequency count for that word. The expression words++ moves the iterator to the next token and the loop repeats. The iterator end marks the end of the sequence of tokens; when words is equal to end the loop terminates.

static void use_regex_iter()
  { // count word frequencies with regular expression iterator
  regex word_sep(seps_rgx);
  sregex_token_iterator words(
    text, text + strlen(text), word_sep, -1);
  sregex_token_iterator end;
  while (words != end)
    ++word_count[*words++];
  }
Listing Seven: The second search using regular expressions.

Conclusion

As you can see from the various search functions, using more powerful tools generally means you have to write less code. That's good, if the overhead imposed by the tools isn't too high. If you're writing for embedded systems with tight memory requirements, you may be better off with an ad hoc parser or maybe strtok. For systems that aren't challenged for memory, regular expressions can help you get your product to market more quickly.

Notes

[1] I confess I was one, using Pascal in ways that would have horrified the purists. Then I discovered C, and never looked back.

[2] Thus leading, of course, to the speculation that template metaprogramming as a number theoretic metasystem is powerful enough to be snared by Kurt Goedel's incompleteness theorem.

[3] The trick is that when a constant, b, exactly divides another constant, a, the expression a%b is a compile-time constant with the value 0. Thus, void *ptr = a%b; is okay when b exactly divides a, but is an error when it doesn't.

[4] In fairness, I'm undecided on the utility of template metaprogramming. I suspect it's a fad, and people will move to less complicated solutions as the excitement wears off. I think it's usually overkill, brought about by trying to solve too general a problem. In most applications there really are only a few important data types, and writing template specializations for those types and only those types may be all that's needed. But I could be wrong.

[5] The C++ Standard Library Extensions: A Tutorial and Reference, to be published by Addison-Wesley this summer.

[6] From O'Reilly & Associates, like so many other truly useful programming books. My thanks to Marc Paterno who, despite being a Yankees fan, had the wisdom to recommend this book to me.

[7] Credit for the original design and implementation of the library goes to John Maddock.

[8] http://www.boost.org/ has an updated version of John Maddock's implementation of the library. Dinkumware is in the final stages of putting together its TR1 implementation, which includes, of course, the regular expressions library.

[9] There's a second type argument that lets you customize the handling of the regular expression. It has a default type, which will almost always be what you need, so for now, think of basic_regex as taking only one type argument.

[10] Don't forget, if the key isn't already present in a map, operator[] adds it, with its associated value set to the value of the aggregate initializer for that type. Here, the associated value is of type int, so new key/value pairs are initialized with a count of 0. Adding 1 to it just works.

[11] This isn't necessary to the algorithm, but in this program the original text is in the form of an array of char, so it has to be copied into a string to use the string member functions.

[12] Phew. That's dense. Which is why we have the version in Listing Seven.

DDJ


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.