Stop Avoiding Regular Expressions Damn It

If you develop in Perl, PHP, Python, Ruby, Javascript (or pretty much any other language with its roots in Unix) and don’t know regular expressions, you are missing a critical piece. And if you’re intentionally avoiding regular expressions, it’s like you’ve torn pages out of YOUR manual, but everyone else’s manual is complete.

XKCD: Regular Expressions

Regular expressions are powerful.

I’m sure you’ve heard this claim before. But you’ve also heard lots of negativity against regular expressions on Stack Overflow and elsewhere. There are two main reasons for this:

  1. Inappropriate use of regular expressions
  2. Lack of understanding and fear of regular expressions

Probably the most common example of an inappropriate use of a single regular expression is to validate an email address. It turns out that most of us have been disallowing valid email addresses with regular expressions all along and that some code is required to properly validate an email address. And even then it’s probably wrong.

Using a regular expression when a simple string function could be used is another example of inappropriate use. For example, say you want to validate an email address by just checking for the @ symbol (which is what I do now — see this). Using a regular expression preg_match( '/@/', $email ) is overkill. There are no expressions there. We’re just searching for a character. It’s better practice to use strpos( $email, '@' ) instead.

I say “better practice” rather than “better performance” because in this example where we’re operating on a tiny email string. The performance is insignificant. In this case, it doesn’t really matter which we use. However, if we were searching a very large string instead of an email address and doing it several times within a loop, there could be a significant improvement in performance. Maybe we’re searching several strings in HTML page sources for example.

If you’re just matching a string constant, don’t use regular expressions.

Unfortunately some developers have misunderstood this to mean that regular expressions are terrible on performance and that they should go out of their way to avoid regular expressions. Some have even adopted it as a reason not to learn them. This is dead wrong.

Let’s add a couple of little requirements to our example. Let’s say we want to check if the @ symbol is the first character in the string and if it is, replace it with a # symbol. Maybe we’re converting tweet replies to hash tags for some reason. Using string manipulations might look something like this:

if ( 0 === strpos( $string, '@' ) ) {
    $string = '#' . substr( $string, 1 );

But this same thing can be achieved with a regular expression more concisely:

$string = preg_replace( '/^@/', '#', $string );

If you haven’t learned regular expressions, you probably feel that this is less readable, but realize that this is more readable to those who have a grip on regular expressions.

If we add another requirement to our example: that the last character also has to be an @, and only then do we replace both @ with a #. Then our string manipulation code gets more complex and less readable:

$length = strlen( $string );
if ( 0 === strpos( $string, '@' ) && $length - 1 === strrpos( $string, '@' ) ) {
    $string = '#' . substr( $string, 1, $length - 1 ) . '#';

Our regular expression changes only slightly:

$string = preg_replace( '/^@(.*)@$/', '#$1#', $string );

These are trivial examples, but hopefully they illustrate how the code can take a turn for the worst if you insist on sticking with string manipulation instead of taking up the challenge of writing a regular expression.

So, the next time you find yourself going out of your way to avoid using a regular expression, just Google for the solution, learn how it was done, tweak it, play with it, break it, struggle with it, and repeat. This is the best way to learn.