xkcd rockar

Introduction

Regular expressions is somewhat of an oxymoron. They are anything but regular, they are strange beasts that you can write wonderful things in that you won't understand fully yourself after you've written them. I blame (as most things) perl for most of the evolution of regular expressions, they ones contained in perl are very powerful, but with great power comes an awful lot of responsibility spaghetti. There is even a term perl regular expressions naming the particularly useful language set supported by perl, which given that perl is a text processing language foremost comes as no surprise. For the more python oriented programmer, python of course also offers a regular expressions library supporting much of the perl stuff, as well as more goodies.

One of the things that C++ lacks from many other languages is out of the box regular expressions. There are a couple of libraries out there that do provide regular expressions support, but they are all not part of the standard library (yet). Here is a brief look at what's out there and what I choose for my own code.

What are we looking for?

So I set out to look for a library that could serve as the backend to whatever regular expression interface I decided to write on my own. Now the ideal library for me should be a C API that you can compile separately and link in statically with your application. Simple and well documented with no (or no significant) external dependencies. Much like the popular compression library zlib. The actual work should be done in the library and my little frontend will merely be a frontend that handles pesky things like error checking, convenient hooks into std::string and memory management so that it will be easy to write tools code that can do regular expressions. I imagined my own interface to be a very light C++:ish wrapper around some opaque state. One other thing about the library. I require it to actually work as advertised and not leak memory!

The candidates

After doing some web-surfing research I came up with a couple of candidates:

Let's take a look at them one by one and see what they have to offer.

John Maddock's boost regular expressions

If you do a search for regular expressions on google you will probably find fairly high up a library written by John Maddock. This has later been incorporated into boost as one of the string libraries. My take on boost is that it represents much that is bad with C++. It's usually too cute/clever, implemented in templates and incur horrible build times. Trying to build boost itself (oh, you thought it was a header only library? Wrong! It only pretends to be. You get the worst of both worlds.) is an exercise in futility. At least some gray hairs has come out of it. As a side note, check out how you take a perfectly fast build system like jam and make it slower than scons, tada - you have boost jam. The funny thing is that none of the decisions they made for boost jam are particularly bad, just that they missed the most important one, speed. If it's slow, it doesn't matter.

Anyhow, the boost regular expression library is pretty much bogged down by the downsides of boost itself, the library pretty much does follow the paradigm and after a very brief moment of craziness I considered pulling boost down, but then I regain sanity. It's a shame since although I don't particularly fancy the actual API to this library, it does support both PERL regular expressions and the POSIX versions.

John Maddock's pre-boost regular expressions

There is actually a pre-boost version of the library on John Maddock's homepage which was quite interesting. It was supposed to do most of the boost version, but without any dependencies on boost itself. Whoho. After pulling this down and massaging it somewhat to build, it turns out that it fails under the criteria "do not leak memory". I could not get the library to not leak memory like crazy, which quickly made it a no-go. And this was trying with the simplified C api that was provided.

Stepping through the code gave me a headache. It's a classical case why type traits, templates and inheritance is a bad thing to mix. Looking for the multitude of allocations the library does, for this simple piece of code:

regcomp( state, "Hello", 0);
regfree(state);

Results in 27 (!) memory leaks.


Memory leaks detected (27)!
c:\aurora\shared\core\new.cpp(29):     24 bytes at 0x00356880
c:\aurora\shared\core\new.cpp(29):      4 bytes at 0x00356a00
c:\aurora\shared\core\new.cpp(29):      4 bytes at 0x00356d00
c:\aurora\shared\core\new.cpp(44):      4 bytes at 0x00356e80
c:\aurora\shared\core\new.cpp(29):      4 bytes at 0x00357000
c:\aurora\shared\core\new.cpp(44):      4 bytes at 0x00357180
c:\aurora\shared\core\new.cpp(29):      4 bytes at 0x00357380
c:\aurora\shared\core\new.cpp(44):      4 bytes at 0x00357500
c:\aurora\shared\core\new.cpp(44):     60 bytes at 0x00357680
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00357800
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00357980
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00357b00
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00357c80
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00357e00
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00357f80
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00358100
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00358280
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00358400
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00358580
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00358700
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00358880
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00358a00
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00358b80
c:\aurora\shared\core\new.cpp(29):      4 bytes at 0x00358d00
c:\aurora\shared\core\new.cpp(44):      4 bytes at 0x00358e80
c:\aurora\shared\core\new.cpp(29):      4 bytes at 0x00359000
c:\aurora\shared\core\new.cpp(44):      4 bytes at 0x00359900

The library has bindings to the standard C++ library allocator interfaces. And they frankly suck. Now, I'm not too crazy about the C interface with realloc, most of the times I prefer the malloc/free interfaces since they are so simple and you will have a hard time to mess them up. Implementing hooks in a third party library is pretty simple as long as you don't stray from the malloc interface. For reasons concerning the language and the standard libraries, the C++ committee decided upon the horrible (non-statful) allocator interface that's part of the C++ standard. It's definitely not pretty, and every time I'm forced to use that to hook up std::map or such I kind of cringe (and die) a little inside. Why anyone would voluntarily expose themselves to this interface is beyond me, when they have the malloc interface to choose from.

To cut a long story short, after a while I just gave up trying to patch the library from it's leaks. If it could not get the init/shutdown sequence right, I don't want to spend more time on getting it right. So I started to look at more alternatives below.

C++ TR1

For those geeks that have lived in the real world for the past couple of years, they might not know of a little thing called TR1 for the C++ language (Meyer's has an article about it). Some of the things in TR1 are headed for C++0x, which is due any time now (hey, they have only two more years to go!). Some compilers already have support for TR1 (MSVC, partial gcc). If your compiler already supports this, then it might be an option to investigate. Most of these things are just imports of a kind from Boost though, so they do suffer in some cases of over engineering in my opinion. But they do sidestep the whole issue of building and maintaining boost. You don't get away from the added compile times though.

Henry Spencer's regular expressions library

One other interesting option was Henry Spencer's regular expression library, which can be found here. The older version is much easier to build yourself if you are on a windows machine, so that's the one I tried out. The problem with the old version of the library is that it implements neither the perl regular expressions nor the POSIX ones. Instead it implements the older API from regexp (notice the last 'p') . The newer version of the library, 3.8, does implement the POSIX header regex but looking at the makefile (which pulls in a bourne shell script as well) for the library made my head hurt. Clearly, on a UNIX machine this might work, but on a windows machine I'm just lost. The older version did compile fairly easily and worked reasonably, but it didn't implement the "right" regular expressions so it kind of fell flat on it's face.

Microsoft Research's GRETA

During my research I found that Microsoft's made what looks like a very good implementation of regular expressions called GRETA. Unfortunately the only download link I could find for this library was one encumbered by the Microsoft Research License. Although the main page claims that there is a commercial version of the library, there is no actual working link to that magical version since it points at the gotdotnet site, which got shutdown.

ATL regular expressions

If you're willing to encumber yourself with dependencies on ATL (basically MFC light on templates) you can check out the CAtlRegExp class that handles regular expressions. I have not looked into this too much since it wasn't an option for me though.

Perl Compatible Regular Expressions

For last I've saved the library I went with, PCRE. The library was reasonably easy to compile, there is a step by step guide provided for non unix systems, as well as providing a reasonably clean C API with documentation. There is a C++ API to the api as well, but I largely avoided this since I prefer the C api for simplicity. After hooking up the memory redirection routines correctly I was off. No memory leaks detected yet in my unit tests. Although the step by step guide looks daunting for the non unix systems, it's pretty straight forward. You can also find the already patched source and visual studio build files here for the library.

In closing

It was surprisingly hard to find something that suited my tastes for a regular expressions library. For a brief moment I was tempted to try to write something myself while at a particular low point after all the memory leaks I encountered, but then I came to my senses. In the end I did spend too much time on this anyways, trying out different libraries and coming to a point where I actually could write a couple of unit tests for them took some time. The only thing that tarnishes this a bit is the fact that over the horizon is the C++0x with it's boost like regular expression library... but that's a couple of years in the future. By then perhaps C++ has died out and we're all comfortably programming in LISP.


Resources

Comments