Finding a regular expression library - Aurora

Finding a regular expression library

| | Comments (2)

Introduction

Regular expressions is somewhat of an oxymoron. They are anything but regular, they are strange beasts that you can write wonderful things in that you won't understand fully yourself after you've written them. I blame (as most things) perl for most of the evolution of regular expressions, they ones contained in perl are very powerful, but with great power comes an awful lot of responsibility spaghetti. There is even a term perl regular expressions naming the particularly useful language set supported by perl, which given that perl is a text processing language foremost comes as no surprise. For the more python oriented programmer, python of course also offers a regular expressions library supporting much of the perl stuff, as well as more goodies.

One of the things that C++ lacks from many other languages is out of the box regular expressions. There are a couple of libraries out there that do provide regular expressions support, but they are all not part of the standard library (yet). Here is a brief look at what's out there and what I choose for my own code.

What are we looking for?

So I set out to look for a library that could serve as the backend to whatever regular expression interface I decided to write on my own. Now the ideal library for me should be a C API that you can compile separately and link in statically with your application. Simple and well documented with no (or no significant) external dependencies. Much like the popular compression library zlib. The actual work should be done in the library and my little frontend will merely be a frontend that handles pesky things like error checking, convenient hooks into std::string and memory management so that it will be easy to write tools code that can do regular expressions. I imagined my own interface to be a very light C++:ish wrapper around some opaque state. One other thing about the library. I require it to actually work as advertised and not leak memory!

The candidates

After doing some web-surfing research I came up with a couple of candidates:

  • John Maddock's boost regular expressions
  • John Maddock's pre-boost regular expressions
  • C++ TR1
  • Henry Spencer's regular expressions library
  • Microsoft Research's GRETA
  • ATL regular expressions
  • Perl Compatible Regular Expressions

Let's take a look at them one by one and see what they have to offer.

John Maddock's boost regular expressions

If you do a search for regular expressions on google you will probably find fairly high up a library written by John Maddock. This has later been incorporated into boost as one of the string libraries. My take on boost is that it represents much that is bad with C++. It's usually too cute/clever, implemented in templates and incur horrible build times. Trying to build boost itself (oh, you thought it was a header only library? Wrong! It only pretends to be. You get the worst of both worlds.) is an exercise in futility. At least some gray hairs has come out of it. As a side note, check out how you take a perfectly fast build system like jam and make it slower than scons, tada - you have boost jam. The funny thing is that none of the decisions they made for boost jam are particularly bad, just that they missed the most important one, speed. If it's slow, it doesn't matter.

Anyhow, the boost regular expression library is pretty much bogged down by the downsides of boost itself, the library pretty much does follow the paradigm and after a very brief moment of craziness I considered pulling boost down, but then I regain sanity. It's a shame since although I don't particularly fancy the actual API to this library, it does support both PERL regular expressions and the POSIX versions.

John Maddock's pre-boost regular expressions

There is actually a pre-boost version of the library on John Maddock's homepage which was quite interesting. It was supposed to do most of the boost version, but without any dependencies on boost itself. Whoho. After pulling this down and massaging it somewhat to build, it turns out that it fails under the criteria "do not leak memory". I could not get the library to not leak memory like crazy, which quickly made it a no-go. And this was trying with the simplified C api that was provided.

Stepping through the code gave me a headache. It's a classical case why type traits, templates and inheritance is a bad thing to mix. Looking for the multitude of allocations the library does, for this simple piece of code:

regcomp( state, "Hello", 0);
regfree(state);

Results in 27 (!) memory leaks.


Memory leaks detected (27)!
c:\aurora\shared\core\new.cpp(29):     24 bytes at 0x00356880
c:\aurora\shared\core\new.cpp(29):      4 bytes at 0x00356a00
c:\aurora\shared\core\new.cpp(29):      4 bytes at 0x00356d00
c:\aurora\shared\core\new.cpp(44):      4 bytes at 0x00356e80
c:\aurora\shared\core\new.cpp(29):      4 bytes at 0x00357000
c:\aurora\shared\core\new.cpp(44):      4 bytes at 0x00357180
c:\aurora\shared\core\new.cpp(29):      4 bytes at 0x00357380
c:\aurora\shared\core\new.cpp(44):      4 bytes at 0x00357500
c:\aurora\shared\core\new.cpp(44):     60 bytes at 0x00357680
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00357800
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00357980
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00357b00
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00357c80
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00357e00
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00357f80
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00358100
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00358280
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00358400
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00358580
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00358700
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00358880
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00358a00
c:\aurora\shared\core\new.cpp(44):      1 bytes at 0x00358b80
c:\aurora\shared\core\new.cpp(29):      4 bytes at 0x00358d00
c:\aurora\shared\core\new.cpp(44):      4 bytes at 0x00358e80
c:\aurora\shared\core\new.cpp(29):      4 bytes at 0x00359000
c:\aurora\shared\core\new.cpp(44):      4 bytes at 0x00359900

The library has bindings to the standard C++ library allocator interfaces. And they frankly suck. Now, I'm not too crazy about the C interface with realloc, most of the times I prefer the malloc/free interfaces since they are so simple and you will have a hard time to mess them up. Implementing hooks in a third party library is pretty simple as long as you don't stray from the malloc interface. For reasons concerning the language and the standard libraries, the C++ committee decided upon the horrible (non-statful) allocator interface that's part of the C++ standard. It's definitely not pretty, and every time I'm forced to use that to hook up std::map or such I kind of cringe (and die) a little inside. Why anyone would voluntarily expose themselves to this interface is beyond me, when they have the malloc interface to choose from.

To cut a long story short, after a while I just gave up trying to patch the library from it's leaks. If it could not get the init/shutdown sequence right, I don't want to spend more time on getting it right. So I started to look at more alternatives below.

C++ TR1

For those geeks that have lived in the real world for the past couple of years, they might not know of a little thing called TR1 for the C++ language (Meyer's has an article about it). Some of the things in TR1 are headed for C++0x, which is due any time now (hey, they have only two more years to go!). Some compilers already have support for TR1 (MSVC, partial gcc). If your compiler already supports this, then it might be an option to investigate. Most of these things are just imports of a kind from Boost though, so they do suffer in some cases of over engineering in my opinion. But they do sidestep the whole issue of building and maintaining boost. You don't get away from the added compile times though.

Henry Spencer's regular expressions library

One other interesting option was Henry Spencer's regular expression library, which can be found here. The older version is much easier to build yourself if you are on a windows machine, so that's the one I tried out. The problem with the old version of the library is that it implements neither the perl regular expressions nor the POSIX ones. Instead it implements the older API from regexp (notice the last 'p') . The newer version of the library, 3.8, does implement the POSIX header regex but looking at the makefile (which pulls in a bourne shell script as well) for the library made my head hurt. Clearly, on a UNIX machine this might work, but on a windows machine I'm just lost. The older version did compile fairly easily and worked reasonably, but it didn't implement the "right" regular expressions so it kind of fell flat on it's face.

Microsoft Research's GRETA

During my research I found that Microsoft's made what looks like a very good implementation of regular expressions called GRETA. Unfortunately the only download link I could find for this library was one encumbered by the Microsoft Research License. Although the main page claims that there is a commercial version of the library, there is no actual working link to that magical version since it points at the gotdotnet site, which got shutdown.

ATL regular expressions

If you're willing to encumber yourself with dependencies on ATL (basically MFC light on templates) you can check out the CAtlRegExp class that handles regular expressions. I have not looked into this too much since it wasn't an option for me though.

Perl Compatible Regular Expressions

For last I've saved the library I went with, PCRE. The library was reasonably easy to compile, there is a step by step guide provided for non unix systems, as well as providing a reasonably clean C API with documentation. There is a C++ API to the api as well, but I largely avoided this since I prefer the C api for simplicity. After hooking up the memory redirection routines correctly I was off. No memory leaks detected yet in my unit tests. Although the step by step guide looks daunting for the non unix systems, it's pretty straight forward. You can also find the already patched source and visual studio build files here for the library.

In closing

It was surprisingly hard to find something that suited my tastes for a regular expressions library. For a brief moment I was tempted to try to write something myself while at a particular low point after all the memory leaks I encountered, but then I came to my senses. In the end I did spend too much time on this anyways, trying out different libraries and coming to a point where I actually could write a couple of unit tests for them took some time. The only thing that tarnishes this a bit is the fact that over the horizon is the C++0x with it's boost like regular expression library... but that's a couple of years in the future. By then perhaps C++ has died out and we're all comfortably programming in LISP.


Resources

  • The regexp proposal to the C++ standard.

2 Comments

C++ allocators can be stateful. I have used this in the past to create a wrapper facade over the Win32 Heap*() functions to allow a custom heap to be used with STL containers.

Jim Tilander Author Profile Page said:


Hi Chris,

Yes, that was a slightly unclear statement from me. Certainly the standard library allocators can have state, in fact if they couldn't then we could not allocate memory from them. So there is implied state from the memory system. In you example, if you wrap the Win32 Heap routines, there is implied state inside the Heap* functions. Eg:

	template 
	class Allocator
	{
	public:
		// ... lots of stuff omitted.
		
		pointer allocate(size_type n, const void * = 0)
		{
			return (pointer)HeapAlloc(GetProcessHeap(),0, n * sizeof(T));
		}

void deallocate(void* p, size_type)
{
HeapFree( GetProcessHead(), p );
}
};

The problem is that the state that the allocator has is a static state in a sense that it's shared between all instances of the Allocator, even all for all types (!) for T.

If we for some example should try to be clever (and this is usually where I blow my foot off :) we could attempt to store off all the memory allocations from certain containers in a specific heap. Now this heap can not be stored in an instance variable of the class. Not if you want to use the allocator with the standard library (in fact, if you only want to use it with your own classes, then you're fine, but why would you want to use a convoluted STL-like allocator in the first place?).

So say that we did the following:

	template 
	class Allocator
	{
		HANDLE mHeap;
	public:
		Allocator()
			: mHeap(HeapCreate(0, 1024*1024,0))
		{
		}
		
		~Allocator()
		{
			HeapDestroy(mHeap);
		}

pointer allocate(size_type n, const void * = 0)
{
return (pointer)HeapAlloc(mHeap,0, n * sizeof(T));
}

void deallocate(void* p, size_type)
{
HeapFree( mHeap, p );
}
};


This could work for non standard containers/client of the allocator class (although, hey I just hardcoded lots of stuff here, one could move it to the template arguments or into some kind of policy ... but this way lies madness).

But the point (longwinded as it might be) I'm trying to make is that for the STL containers, this is not acceptable. Allocators that is going to be used by the library can not have instance state.

	20.1.5.4

Implementations of containers described in this International Standard are permitted to assume that their
Allocator template parameter meets the following two additional requirements beyond those in Table 32.

— All instances of a given allocator type are required to be interchangeable and always compare equal to
each other.

The above coupled with the rule that allocator equality implies that memory allocated from one allocator can be deallocated from the other generally prevents you to have instance state in the allocator. Scott Meyers explains all of this in his book Effective STL (item 10 I think) in detail, far better than I can do here. The pathological case that he brings up is the list splicing and then deallocation of the nodes of the newly spliced list.

So yes, most probably your Win32 Heap* wrapper works just fine since it never stores instance state in the allocators. Also note that if one should break this rule, nothing bad might actually happen on your platform. It most certainly would compile. But on certain STL implementations you will probably run into some very strange runtime errors / corruptions... most of will be very hard to track down...

(I'm sorry if I made any errors in the above code, I never even bothered with trying to compile it even. Web coding! :)

Leave a comment