research!rsc

Thoughts and links about programming, by

Irregular expression matching with the .NET stack
Posted on Tuesday, May 10, 2011.

Python introduced a named capturing form in its regular expressions: (?P<x>...) is like (...) but you refer to it as \k'x' instead of \1. Perl and .NET picked it up without the Pythonic P. They also let you say (?'x'...) instead of (?<x>...). (Perl: there's more than one way to do it!)

You can use these during the match just like any backreference; the value of \k'x' is the most recent text matched by (?<x>...). So (?<d>[0-9])+\k'd' matches any digit string in which the last two digits are the same.

.NET introduces a variant on the named capture, (?<-x>...), which, during the match, deletes the last captured substring for x, exposing the one that was there before. This implies that the named captures are stored on an internal per-name stack and it lets you do nested matching: (?<d>[0-9])+(\k'd'(?<-d>))+ matches the longest prefix of an even-length palindrome of digits, so 12345543 in 1234554399.

There's a Perl addition that appears to have snuck into .NET undocumented, which is that you can say (?(cond)then|else) or just (?(cond)then) as a conditional expression: if cond is true, then it tries to match then, else else (or the empty string). The condition 1 means that there was a match for \1. In Perl the condition <name> means that there was a match for the named back-reference \k'name'. In .NET it appears that the syntax is just name, by closer analogy with the numbered backreferences.

If you want full palindromes, you can insert a condition that there must be no matches left for d on the stack, by using d as the condition and (?!) - which cannot match anything - as the true branch. The final thunderclap, then, is:

(?<d>[0-9])+(\k'd'(?<-d>))+(?(d)(?!))

I haven't tried this, since I don't have easy access to .NET.

This post expands on the technique to match well-formed XML.

(Comments originally posted via Blogger.)

  • Wolfgang Kluge (November 15, 2011 2:25 PM) You can always use the index or the name in the .NET implementation. It's even valid to define a group with an index >0.

    (?<4>.)
    (?<5-4>a)
    \k<5>
    (?(4)b|c)

    I don't like it, but it's possible...