Blog

Stop Avoiding Regular Expressions Damn It

If you develop in Perl, PHP, Python, Ruby, Javascript (or pretty much any other language with its roots in Unix) and don’t know regular expressions, you are missing a critical piece. And if you’re intentionally avoiding regular expressions, it’s like you’ve torn pages out of YOUR manual, but everyone else’s manual is complete.

XKCD: Regular Expressions

Regular expressions are powerful.

I’m sure you’ve heard this claim before. But you’ve also heard lots of negativity against regular expressions on Stack Overflow and elsewhere. There are two main reasons for this:

  1. Inappropriate use of regular expressions
  2. Lack of understanding and fear of regular expressions

Probably the most common example of an inappropriate use of a single regular expression is to validate an email address. It turns out that most of us have been disallowing valid email addresses with regular expressions all along and that some code is required to properly validate an email address. And even then it’s probably wrong.

Using a regular expression when a simple string function could be used is another example of inappropriate use. For example, say you want to validate an email address by just checking for the @ symbol (which is what I do now — see this). Using a regular expression preg_match( '/@/', $email ) is overkill. There are no expressions there. We’re just searching for a character. It’s better practice to use strpos( $email, '@' ) instead.

I say “better practice” rather than “better performance” because in this example where we’re operating on a tiny email string. The performance is insignificant. In this case, it doesn’t really matter which we use. However, if we were searching a very large string instead of an email address and doing it several times within a loop, there could be a significant improvement in performance. Maybe we’re searching several strings in HTML page sources for example.

If you’re just matching a string constant, don’t use regular expressions.

Unfortunately some developers have misunderstood this to mean that regular expressions are terrible on performance and that they should go out of their way to avoid regular expressions. Some have even adopted it as a reason not to learn them. This is dead wrong.

Let’s add a couple of little requirements to our example. Let’s say we want to check if the @ symbol is the first character in the string and if it is, replace it with a # symbol. Maybe we’re converting tweet replies to hash tags for some reason. Using string manipulations might look something like this:

if ( 0 === strpos( $string, '@' ) ) {
    $string = '#' . substr( $string, 1 );
}

But this same thing can be achieved with a regular expression more concisely:

$string = preg_replace( '/^@/', '#', $string );

If you haven’t learned regular expressions, you probably feel that this is less readable, but realize that this is more readable to those who have a grip on regular expressions.

If we add another requirement to our example: that the last character also has to be an @, and only then do we replace both @ with a #. Then our string manipulation code gets more complex and less readable:

$length = strlen( $string );
if ( 0 === strpos( $string, '@' ) && $length - 1 === strrpos( $string, '@' ) ) {
    $string = '#' . substr( $string, 1, $length - 1 ) . '#';
}

Our regular expression changes only slightly:

$string = preg_replace( '/^@(.*)@$/', '#$1#', $string );

These are trivial examples, but hopefully they illustrate how the code can take a turn for the worst if you insist on sticking with string manipulation instead of taking up the challenge of writing a regular expression.

So, the next time you find yourself going out of your way to avoid using a regular expression, just Google for the solution, learn how it was done, tweak it, play with it, break it, struggle with it, and repeat. This is the best way to learn.

  • http://twitter.com/josefusbarnabas Joe Barnes

    +1 I love regexes and their declarative goodness. There are some problems I’ve solved that would have been horrendous without them.

  • ed

    “or pretty much any other language with its roots in Unix”
    Well, why just them? I code in C# and I don’t think it would had any better ability for text parsing than its Perl regular expressions.

  • http://bradt.ca/ Brad Touesnard

    Good point. I didn’t mean to limit it to Unix-based languages, but just to state that regular expressions are a big component of those languages. I do recall VB6 having pretty terrible regular expressions implementation though. :)

  • mastfish

    Regexes have their uses, but the biggest issue is their unreadability.
    Trying to debug a regex that someone put in 2 years ago that doesn’t work quite correctly? A complete nightmare.

    Almost always I’d prefer something more verbose and readable.

  • http://bradt.ca/ Brad Touesnard

    I agree, some regexes are really difficult to understand, but that doesn’t mean that they were wrong to use them. If a regex is complex it should most definitely be documented why it is so complex and exactly what it accomplishes. Check out “Documenting the Why” http://ianlotinsky.wordpress.com/2013/03/06/document-the-why/

  • Pingback: In the News: 2013-05-13 | Klaus' Korner()

  • Paddy McCarthy

    I agree and would add that there are times when you need to use a regexp but the regexp is complex. That is the time to use verbose regexps in which white-space is ignored, (use s instead); and comments can be inserted on a line after a hash, # character. There is also the ability to name groups so you can refer to what you capture as, for example, sort_code rather than an opaque matching group number (that changes if you insert another group before it in your regexp).

    These extra features are in Python and Perl, (not sure but probably in Ruby too).

  • Uniqorn

    The question mark in the last regular expression isn’t necessary.

  • http://bradt.ca/ Brad Touesnard

    Good catch! Updated.

  • hue948274986796

    As a result, the brackets aren’t either.

  • http://bradt.ca/ Brad Touesnard

    Actually, the parentheses are needed to match the subpattern, set the backreference, and use it in the replace.

  • http://bradt.ca/ Brad Touesnard

    Actually, the parentheses are needed to match the subpattern, set the backreference, and use it in the replace.

  • hue948274986796

    Forgot we were talking about a replace here, my bad.

  • Pingback: ???????? ?????????? ???????? ? ?????????? ?? ???? PHP ?? ????????? ??? ?????? ?17 (06.05.2013 — 21.05.2013) - Juds()

  • dsdaru

    Stopped reading after the first example, apparently the author does not know PHP and regular expressions. ;-)

    I just leave it here: if ($s[0].substr($s, -1) == ‘@@’) { $s[0]=$s[strlen($s)-1]=’#'; }

  • http://www.facebook.com/troels.villy.larsen Troels Larsen

    People have an (understandable) tendency to shy away from regexes. They are completely unreadable to the untrained eye. The issue is that depending on what type of software you write, you don’t use regexes regularly enough to actually remember the syntax. This is definitely my issue with them. All other code – even in languages (Ook! and Branfcuk aside) I’ve never used before can be read without consulting the manual. Regexes cannot. They simply make no sense without a manual.

    I have yet to encounter a problem that wasn’t easily solved by simple string manipulation. At my work, we have a standing rule: if you add a regex to any function, you now have code ownership of that function.

  • http://bradt.ca/ Brad Touesnard

    Ever hear of the Bus Factor? Probably not a good idea for only one person to understand part of your system.

  • Pingback: ???????? | ???????()

  • Pingback: My Journey To Managing My Own WordPress Server - bradt.ca()

  • Pingback: Ignoring errors from third party WordPress plugins while developing in debug mode - bradt.ca()

  • Eddy

    preg_match() always results in buggy code and is a fairly significant sign that possibly insecure code is ahead (if insecure code hasn’t been run into already). preg_match() is on my “never use this PHP function” list and constantly raises red flags every single time I see it and review such code. preg_match() is a security vulnerability magnet.

    preg_replace(), on the other hand, is fine to use. I’ve found far fewer problems with use of that function and even use it myself. I suppose it helps that the use-cases for preg_replace() are very different from those of preg_match().

    For HTML parsing, there are several libraries out there. Simple HTML DOM, for example, is well-tested and works VERY well. It is way more resilient over time than regular expressions. Again, preg_match() results in buggy, easily broken code. For quick-n-dirty solutions that are run exactly one time and then deleted, preg_match() may or may not be easier than a HTML parsing library. But, for everything else, a real HTML parsing library is absolutely essential.

  • stop_regex_race

    RegEx is for robots.

    Totally not understandable, not readable, not logic in any way. Complete visual misusing mathematical operators. RegEx makes brain a shit. The man who invented it should be set on fire and thrown from the helicopter.

Comments Elsewhere