Archive for the ‘Code’ Category

Perl Regex Tip: Zero-width positive look-aheads are non-capturing.

Thursday, January 3rd, 2008

I used to write regexes all the live long day in a language called WebQL. During that time I must’ve written somewhere between 5000-10000 lines of regular expressions to extract data from various data feeds the crawlers would pull down every day. It’s all fun and games until an angry customer calls wondering why they missed one field of data from the hundreds of thousands you extracted in the past 5 minutes, hopefully this will help someone not face that situation.

The Problem: Match pattern X until I see pattern X again.

In regex-ese the correct solution is (Fig 1):

/((?:some_pattern_goes_here).*?)(?=some_pattern_goes_here)

My common pitfall is writing this as (Fig 2):

/((?:some_pattern_goes_here).*?)(?:some_pattern_goes_here)

Now I’m going to bore you with some WebQL examples. WebQL is a language that puts SQL like syntax around Perl 5 compatible regexes and allows you to easily pull data from sources like web pages, files, web services, etc. operating/processing the result like rows from a SQL select statement.

For a bit of compare and contrast I’ll show you SQL versus WebQL.

SQL looks like:

select 
    REGEX_EXTRACTION_STORED_PROCEDURE(data_to_extract)
from (select '1234 bar  junk junk junk junk junk 1234 bar junk 
         junk junk junk' as data_to_extract)

Where that REGEX… portion runs a function to execute a regular expression and return the capturing groups.

WebQL looks like:

select
    ITEM1
from pattern '(\d+\s+bar.*?)\d+\s+bar'
within inline '1234 bar  junk junk junk junk junk 1234 bar junk 
junk junk junk'

If you run the code from Fig 1 and Fig 2 you get different results, Fig 1. returns us the expected 2 matches:

    1234 bar junk junk junk junk junk
    5678 bar junk junk junk junk junk

Fig 2. returns us only 1 match:

    1234 bar junk junk junk junk junk

The key is the zero-width positive look-ahead assertion: (?=)

(?=pattern)

A zero-width positive look-ahead assertion. For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab in $&. http://www.perldoc.com/perl5.6/pod/perlre.html

And, what is this $& they speak of?

$&

The string matched by the last successful pattern match (not counting any matches hidden within a BLOCK or eval() enclosed by the current BLOCK). > (Mnemonic: like & in some editors.) This variable is read-only and dynamically scoped to the current BLOCK. http://www.perldoc.com/perl5.6/pod/perlvar.html

That means if you use a capturing expression (?:) for example, that is where the next match will start.

The expression in Fig 2 leaves $& just at #MATCH#. So, its obvious from this that you can’t match and capture “5678 bar junk . . . “

'1234 bar junk junk junk junk junk 5678 bar#MATCH# junk junk junk junk '

Fig 1 will leave $& at #MATCH# so you will capture all you were hoping for:

'1234 bar junk junk junk junk junk#MATCH# 5678 bar junk junk junk junk '

In summary, zero-width positive and negative look-aheads are good. :)

I honestly don’t know who else is using WebQL, but they should remember that it has more abilities than are shown in it’s manual becaus under the hood it has Perl 5 RE’s. Everyone working with regular expression seriously should spend a week studying Perl RE.

Shard your data or you have failed at life.

Tuesday, December 18th, 2007

Protip: if you are starting a business whose success hinges on scalability of a data store, you had best figure out how to shard across N machines before you launch. Using a single instance of MySQL for the whole thing is a strong indicator that you have failed at life.

http://www.uncov.com/2007/12/17/winer-scoble-fail-in-tandem

This is why I believe that JRuby serving data via ActiveHibernate with some Hibernate Shards sounds so interesting. If you’re interested in doing this, you should get in touch with me wes at brokenbuild dot com. Subject line that with “Just shard it!”

SCM and Alternative Input Methods

Friday, November 23rd, 2007

We’re close to a day when we’ll have the Minority Report interface for managing our branches and merging.

We’ve gone from the wonderful 2D and whiteboard drawings/DOT graphs to some interesting 3D visualizations coupled with alternative interfaces:

Not that I think most people need 3D to get meaningful work accomplished, but I think it would make the lives of developers managing interesting branching and merging scenarios more interesting… and who doesn’t need an excuse to finally buy a Wii?

Everyone welcome JB Brown to the blog party.

Sunday, June 3rd, 2007

My friend JB Brown has started blogging with a first post about committing a patch to the Code Review Add-In for Team Foundation Server. I hope to see more great posts from him in the coming weeks.

Code duplication: Methods made out of ticky tacky.

Monday, March 12th, 2007

I was reading Ted Carnahan on ticky tacky and it reminded me of code duplication. Ever start feeling like you’re in a pre-fab sub-division when you’re looking at your code? You might just have some code duplication to take care of.

The phrase ticky tacky was coined by Malvina Reynolds in the song Little Boxes made popular again by the show Weeds.

Ticky tacky: Sleazy or shoddy material used especially in the construction of look-alike tract houses

http://www.m-w.com/dictionary/ticky-tacky

Next time you’re refactoring your tests to reduce test code duplication hum along to this Elvis Costello cover of Little Boxes with some altered lyrics:

Little methods in the editor,

Little methods made of ticky tacky

Little methods in your classes,

Little methods all the same,

Theres a green one and a pink one,

And a blue one and a yellow one,

And they’re all made out of ticky tacky

And they all look just the same.

Do it: use ridiculously long test method names

Thursday, March 1st, 2007

At work I’m probably known by a couple people for writing riciculously long test name, so this post by Patrick Lightbody made me smile.

Unfortunately, Dan’s examples of test methods were rather weak - often the agiledox crowd uses really simplistic examples. So I thought I’d give some examples of some of my methods from HostedQA:

  • parsingHostHandlesHttpAndHttpsAsWellAsOptionalPorts
  • findingAnAccountByNameWorksAfterCreatingAnAccountOfTheSameName
  • creatingNewAppConfigReturnsSameIdAsTheIdSetOnTheAppConfigReference
  • findingAppConfigsForAProjectReturnsThemInAlphabeticalOrderByName

And my personal favorites (broken up with a space because they are so long):

  • updatingAppConfigWithoutResourcesThat UsedToBeAssociatedWithItCausesThoseReferencesToBeDeleted
  • deletingAPopulatedProjectCausesAllChild EntitiesToAlsoBeDeletedAndTheProjectIsCompletelyDeleted

Kind of annoyingly long methods? Absolutely. But not only, as Dan points out, is it easier to know what went wrong when a failure occurs, it is also almost brainless to implement these tests (and thereby design your application in the most direct manner possible).

Maven2, Cargo and deploying to Jetty 6 with Commons Logging

Wednesday, February 21st, 2007

Java and it’s classloaders, always a fun time!

Every been using Maven2, Cargo and deploying to Jetty 6 with Commons Logging? Probably not, but if you were you’d run into this I bet.

org.apache.commons.logging.LogConfigurationException: Invalid class loader hierarchy.

This is a Type-2 Classloader error. I know, what does that mean?

Type-II: Assignment incompatibility of two classes loaded by distinct class loaders, even in case where the two classes are bit-wise identical.

Taxonomy of class loader problems encountered when using Jakarta Commons Logging

Sweet huh? Here is the relevant section from the taxonomy. Just so you know JCL means Java Commons Logging, TCCL means Thread Context Class Loader and CFCL means Child First Class Loader.

JCL in child-first class loader trees

In child-first class loader trees, JCL suffers from problems of both Type-I and Type-II. We will start this section with an example reproducing a Type-II problem.

http://www.qos.ch/logging/classloader.jsp

Need a quick fix? The quick fix is to use SLF4J as you can see recommended in this thread on the Cargo mailing list.

The Simple Logging Facade for Java or (SLF4J) is intended to serve as a simple facade for various logging APIs allowing to the end-user to plug in the desired implementation at deployment time. SLF4J also allows for a gradual migration path away from Jakarta Commons Logging (JCL).

http://www.slf4j.org

One last thing. If you’re running Jetty 6 and attempting to use Cargo you should know this:

Jetty in cargo at the moment is an embedded container, that is, the jetty classes have to be on the classpath at runtime. So, you can’t use the installer to install a version of jetty and then invoke a container on it.

Same thread from the Cargo mailing list as above.

Napoleon Dynamite speaks on the Abstract Factory Pattern

Tuesday, February 13th, 2007

Oh that Napoleon Dynamite such the software developer…

Don: Hey, Napoleon. What did you do yesterday?

Napoleon Dynamite: I told you! I spent it with coding up sweet classes!

Don: Design Patterns, did you use any?

Napoleon Dynamite: Yes, like 50 of ‘em! Clients kept trying to instantiate my concrete classes, what the heck would you do in a situation like that?

Don: What kind of pattern did you use?

Napoleon Dynamite: A freakin’ abstract factory, what do you think?

Napoleon Dynamite

Test Driven Development: Fake it till you make it.

Friday, February 9th, 2007

xkcd.com: Random Number

Need more info? Can Test-Driven Development and Programming By Intention play together?

Testivus - Testing for the rest of us.

Wednesday, February 7th, 2007

I am a believer in Testivus.

Developers need to take more responsibility for testing their code. But the majority of developers are not willing, nor ready, nor able to jump on the bandwagon of the more extreme and demanding developer testing movements such as Test Driven Development. Testivus is a proposed developer testing movement “for the rest of us”.

Developer Testing - Testing for the rest of us.

Below is the first draft of the Testivus Manifestivus.

The Testivus Manifestivus (First Draft)

-Less testing dogma, more testing karma

-Any tests are better than no tests

-Testing beats debugging

-Test first, during, or after – whatever works best for you

-If a method, technique, or tool, gives you more or better tests use it