Irregular expression matching with the .NET stack
Posted on Tuesday, May 10, 2011.
Python introduced a named capturing form in its regular expressions:
(?P<x>...) is like
(...) but you refer to it as
\k'x' instead of
\1. Perl and .NET picked it up without the Pythonic P. They also let you say
(?'x'...) instead of
(?<x>...). (Perl: there's more than one way to do it!)
You can use these during the match just like any backreference; the value of
\k'x' is the most recent text matched by
(?<d>[0-9])+\k'd' matches any digit string in which the last two digits are the same.
.NET introduces a variant on the named capture,
(?<-x>...), which, during the match, deletes the last captured substring for x, exposing
the one that was there before. This implies that the named captures are stored on an internal per-name stack and it lets you do nested matching:
(?<d>[0-9])+(\k'd'(?<-d>))+ matches the longest prefix of an even-length palindrome of digits, so
There's a Perl addition that appears to have snuck into .NET undocumented, which is that you can say
(?(cond)then|else) or just
(?(cond)then) as a conditional expression: if
cond is true, then it tries to match
else (or the empty string). The condition
1 means that there was a match for
\1. In Perl the condition
<name> means that there was a match for the named back-reference
\k'name'. In .NET it appears that the syntax is just
name, by closer analogy with the numbered backreferences.
If you want full palindromes, you can insert a condition that there must be no matches left for
d on the stack, by using
d as the condition and
(?!) - which cannot match anything - as the true branch. The final thunderclap, then, is:
I haven't tried this, since I don't have easy access to .NET.
This post expands on the technique to match well-formed XML.