Lowercase z FTW: luluisme

luluisme

Lowercase z FTW

Jul 08, 2009 12:39

So, I do a lot of regular expression work in my job. And I ran into an interesting quandary recently.

I was presented with the following request: find a way to add a bit of text to the end of a file.
I think... lalala... that's pretty easy:
regex: (?s)(.)$
which should mean: turn on dot-all mode, and match one of any character including newline just before the end of the string.
replacement: $1\nFoobar
which means put that character we matched back and follow it with a newline and my text.

But instead the newline and Foobar text got added twice. What the dilly?

Now $ can mean newline or end of string, but it wasn't matching on any other newlines. Puzzling.

I poked around online for some other string ending boundary options to try and ran into \z and \Z.
From Mastering Regular Expressions by Jeffrey Friedl:
\Z Always matches like normal $
\z Always matches only at end of string

So, I tried the lower case z option, and it worked.

Here's what I think happened:
An octal dump of a test file shows the characters:
$ 0000000 a b c \n
I think that $ was matching both the newline at the end of the string, and the EOF at the end of the file.
Whereas \z, which only matches EOF, was therefore only matching once.

This seems a little peculiar, given that it wasn't matching other newlines, but there you go. The working regex was:
(?s)(.)\z

tech