Perl Regex Tip: Zero-width positive look-aheads are non-capturing.

I used to write regexes all the live long day in a language called WebQL. During that time I must’ve written somewhere between 5000-10000 lines of regular expressions to extract data from various data feeds the crawlers would pull down every day. It’s all fun and games until an angry customer calls wondering why they missed one field of data from the hundreds of thousands you extracted in the past 5 minutes, hopefully this will help someone not face that situation.

The Problem: Match pattern X until I see pattern X again.

In regex-ese the correct solution is (Fig 1):

/((?:some_pattern_goes_here).*?)(?=some_pattern_goes_here)

My common pitfall is writing this as (Fig 2):

/((?:some_pattern_goes_here).*?)(?:some_pattern_goes_here)

Now I’m going to bore you with some WebQL examples. WebQL is a language that puts SQL like syntax around Perl 5 compatible regexes and allows you to easily pull data from sources like web pages, files, web services, etc. operating/processing the result like rows from a SQL select statement.

For a bit of compare and contrast I’ll show you SQL versus WebQL.

SQL looks like:

select 
    REGEX_EXTRACTION_STORED_PROCEDURE(data_to_extract)
from (select '1234 bar  junk junk junk junk junk 1234 bar junk 
         junk junk junk' as data_to_extract)

Where that REGEX… portion runs a function to execute a regular expression and return the capturing groups.

WebQL looks like:

select
    ITEM1
from pattern '(\d+\s+bar.*?)\d+\s+bar'
within inline '1234 bar  junk junk junk junk junk 1234 bar junk 
junk junk junk'

If you run the code from Fig 1 and Fig 2 you get different results, Fig 1. returns us the expected 2 matches:

    1234 bar junk junk junk junk junk
    5678 bar junk junk junk junk junk

Fig 2. returns us only 1 match:

    1234 bar junk junk junk junk junk

The key is the zero-width positive look-ahead assertion: (?=)

(?=pattern)

A zero-width positive look-ahead assertion. For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab in $&. http://www.perldoc.com/perl5.6/pod/perlre.html

And, what is this $& they speak of?

$&

The string matched by the last successful pattern match (not counting any matches hidden within a BLOCK or eval() enclosed by the current BLOCK). > (Mnemonic: like & in some editors.) This variable is read-only and dynamically scoped to the current BLOCK. http://www.perldoc.com/perl5.6/pod/perlvar.html

That means if you use a capturing expression (?:) for example, that is where the next match will start.

The expression in Fig 2 leaves $& just at #MATCH#. So, its obvious from this that you can’t match and capture “5678 bar junk . . . “

'1234 bar junk junk junk junk junk 5678 bar#MATCH# junk junk junk junk '

Fig 1 will leave $& at #MATCH# so you will capture all you were hoping for:

'1234 bar junk junk junk junk junk#MATCH# 5678 bar junk junk junk junk '

In summary, zero-width positive and negative look-aheads are good. :)

I honestly don’t know who else is using WebQL, but they should remember that it has more abilities than are shown in it’s manual becaus under the hood it has Perl 5 RE’s. Everyone working with regular expression seriously should spend a week studying Perl RE.

Leave a Reply