Glob Matching Can Be Simple And Fast Too
Posted on Monday, April 24, 2017.
Here’s a straightforward benchmark.
Time how long it takes to run ls
(a*
)nb
in a directory with a single file named a
100,
compared to running ls
|
grep
(a.*
)nb
.
Superscripts denote string repetition and parentheses are for grouping only,
so that when n is 3, we’re running ls
a*a*a*b
in a directory containing the single file aaa
...aaa
(100 a
’s),
compared against ls
|
grep
a.*a.*a.*b
in the same directory.
For n = 8, ls
takes 7.19 minutes while ls
|
grep
runs in 1.56 milliseconds,
making it 276,538X faster.
If you’ve read my 2007 article “Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...),”
those graphs may look familiar.
Clearly the ls
command is using an exponential pattern-matching algorithm,
while the ls
|
grep
command is using a nearly linear one.
Shells
In fact it’s the shell that evaluates the glob pattern in the first command, not ls
,
so let’s repeat the experiment with a variety of shells.
All the tests were run on an HP Z440 workstation with 3.5 GHz Intel Xeon E5-1650 v3 processors
running Ubuntu 14.04.
a* )nb
against
a 100 in shells
|
Basically, the same thing is going on here as in my regular expression article: backtracking was an obvious implementation strategy, so most implementations use it, causing the performance cliffs.
The exception seems to be the original Berkeley csh, which runs in linear time (more precisely, time linear in n).
Looking at the source code, it doesn’t attempt to perform glob expansion itself.
Instead it calls the C library implementation glob(3),
which runs in linear time, at least on this Linux system.
So maybe we should look at programming language implementations too.
Programming Languages
Most programming languages provide some kind of glob expansion,
like C’s glob
.
Let’s repeat the experiment in a variety of different programming languages:
a* )nb
against
a 100 in programming languages
|
Perhaps the most interesting fact evident in the graph
is that GNU glibc, the C library used on Linux systems,
has a linear-time glob
implementation, but
BSD libc, the C library used on BSD and macOS systems,
has an exponential-time implementation.
PHP is not shown in the graph, because its
glob function simply invokes
the host C library’s glob(3),
so that it runs in linear time on Linux
and in exponential time on non-Linux systems.
(I have not tested what happens on Windows.)
All the languages shown in the graph, however,
implement glob matching without using the host C library,
so the results should not vary by host operating system.
FTP Servers
It’s not just shells and programming languages that implement glob matching. Although not mandated by the FTP RFC, most FTP servers allow glob patterns as the argument to commands like LIST and STAT. Let’s repeat the experiment with a variety of FTP server implementations:
a* )nb
against
a 100 in FTP servers
|
On Linux, Pure-FTPd would probably have run for a hundred or more seconds for n = 7, but instead it died and hung up the connection after 17 seconds. All the tests were run on the same Linux system as before, except tnftpd, which was run on a mid-2015 MacBook Pro with a 2.8 GHz Intel Core i7 processor running macOS 10.12.4. On that Mac system, tnftpd (which can only be enabled using the command-line, so most people don’t run it) has no such time limit: even if a client times out and hangs up (as the command-line ftp client does after 30 seconds), the server side still consumes CPU until the full pattern match finishes.
The netkit ftpd runs quickly on Linux because it relies on the host
C library’s glob
function.
If run on BSD, the netkit ftpd would take exponential time.
ProFTPD ships a copy of the glibc glob
,
so it should run quickly even on BSD systems.
Ironically, Pure-FTPd and tnftpd take exponential time on Linux
because they ship a copy of the BSD glob
function.
Presumably they do this to avoid assuming that the host C library is bug-free,
but, at least in this one case,
the host C library is better than the one they ship.
Obviously not many servers have a file named a
100,
but the pattern can be adapted to shorter, less unusual names.
This gives a denial of service attack against some anonymous FTP servers.
It appears that tnftpd is derived from the standard FreeBSD ftpd,
so BSD systems may be affected as well.
FTP servers and C libraries have a long history of problems with glob patterns, including buffer and heap overflows causing crashes or even remote code execution. But here let’s focus on CPU exhaustion issues around pattern matching.
In 2001, CVE-2001-1501 was issued for a vulnerability in ProFTPD 1.2.1, because it could run for a very long time finding and recording the very many matches for a pattern like:
ls */../*/../*/../*/../*/../*/../*/../*/../*/../*
In response, most glob
implementations added a GLOB_LIMIT
flag that can be used to limit the number of matches returned, controlling both the CPU and the memory usage for such a pattern.
Unfortunately, a pattern like that can cause a lot of effort without finding any matches. In 2010, CVE-2010-2632 was issued for a variant with no matches:
ls */../*/../*/../*/../*/../*/../*/../*/../*blablahaha
In response, most glob
implementations expanded GLOB_LIMIT
to count directory read and file stat operations in addition to matches.
In 2015, CVE-2015-5917 was issued for the same vulnerability in OS X 10.10.5’s FTP server (reported in 2013), which had its own copy of the glob
implementation and had not been patched in 2010.
There have also been similar problems around FTP servers implementing brace expansion,
such as CVE-2011-0418 (details).
Unfortunately, none of these protections address the cost of matching a single path element of a single file name. In 2005, CVE-2005-0256 was issued for a DoS vulnerability in WU-FTPD 2.6.2, because it ran for a very long time finding even a single match during:
ftp> dir ************************************************ ************************************************ ************************************************ **.*
The apparent “fix” used in some implementations is
to collapse multiple adjacent stars into a single star.
That solves one test case, but not the general problem.
In particular it doesn’t help a*a*a*a*a*a*a*a*b
.
The next section discusses ways to implement glob patterns efficiently, but if you have an anonymous FTP server accepting glob patterns, there are two more fundamental questions to ask: Do you really need to run an anonymous FTP server anymore? Does it really need to accept glob patterns?
At the very least, most FTP servers should probably reject
glob patterns with more than, say, 3 stars.
Note that glob patterns are only provided as a convenience for
command-line FTP users.
Graphical FTP clients typically use the
MLST
and MLSD
commands,
which have a machine-readable output format and are defined not to
interpret their arguments as glob patterns.
Implementation Details
How do these implementations work? What’s going on here?
The obvious implementation of glob pattern matching against a single path element is to walk the pattern and the name together, matching letters or wildcards in the pattern to letters in the name. If the walk reaches the end of the pattern at the same time as the end of the name, they match.
Here’s a basic outline of that algorithm, in Go:
func match(pattern, name string) bool { px := 0 nx := 0 for px < len(pattern) || nx < len(name) { if px < len(pattern) { c := pattern[px] switch c { default: // ordinary character if nx < len(name) && name[nx] == c { px++ nx++ continue } case '?': // single-character wildcard if nx < len(name) { px++ nx++ continue } case '*': // zero-or-more-character wildcard ... } } // Mismatch. return false } // Matched all of pattern to all of name. Success. return true }
In this code, px
is the index into the pattern
and nx
the index into the name.
The loop runs until both pattern and name are exhausted,
meaning px
==
len(pattern)
and nx
==
len(name)
.
If it does happen that both pattern and name are exhausted at the same time,
then we have a match (the final return
true
).
Inside the loop, the code must make progress (advance px
or nx
or both) and continue
on each iteration.
If not, the control flow ends up at the bottom of the loop body
(marked //
Mismatch.
) and reports no match.
The only difficulty is in the implementation of the variable-length wildcard pattern (case
'*'
):
how much of the name should that match?
In general, the code must try all possibilities from matching nothing to matching the entire remainder of the string.
The obvious way to do that is with recursion:
func match(pattern, name string) bool {
px := 0
nx := 0
for px < len(pattern) || nx < len(name) {
if px < len(pattern) {
c := pattern[px]
switch c {
default: // ordinary character
if nx < len(name) && name[nx] == c {
px++
nx++
continue
}
case '?': // single-character wildcard
if nx < len(name) {
px++
nx++
continue
}
case '*': // zero-or-more-character wildcard
// Try to match at nx, nx+1, and so on.
for ; nx <= len(name); nx++ {
if match(pattern[px+1:], name[nx:]) {
return true
}
}
}
}
// Mismatch.
return false
}
// Matched all of pattern to all of name. Success.
return true
}
If there are e stars that can each potentially end their matches at n positions in the string,
that gives ne possibilities to explore, leading to the exponential run times observed earlier.
However, most of these possibilities are not worth exploring.
Because *
is the only variable-sized
wildcard in the pattern syntax and therefore the only
source of possible backtracking, there’s an even easier implementation:
don’t backtrack to an earlier star.
Consider the pattern a*bx*cy*d
.
If we end the first star at the first bx
, we have the rest of the name to find the cy
and then the d
.
Using any later bx
can only remove choices for cy
and d
; it cannot lead to a successful match
that using the first bx
missed.
So we should implement
a*bx*cy*d
without any second-guessing,
as
“find a leading a
, then find the earliest bx
after that, then find the earliest cy
after that, then find a trailing d
, or else give up.”
We can implement this algorithm by replacing the handling of the star wildcard in our Go program:
func match(pattern, name string) bool {
px := 0
nx := 0
nextPx := 0
nextNx := 0
for px < len(pattern) || nx < len(name) {
if px < len(pattern) {
c := pattern[px]
switch c {
default: // ordinary character
if nx < len(name) && name[nx] == c {
px++
nx++
continue
}
case '?': // single-character wildcard
if nx < len(name) {
px++
nx++
continue
}
case '*': // zero-or-more-character wildcard
// Try to match at nx.
// If that doesn't work out,
// restart at nx+1 next.
nextPx = px
nextNx = nx + 1
px++
continue
}
}
// Mismatch. Maybe restart.
if 0 < nextNx && nextNx <= len(name) {
px = nextPx
nx = nextNx
continue
}
return false
}
// Matched all of pattern to all of name. Success.
return true
}
Each time the code encounters a star, it implements the repeated trials needed
for that star by recording in nextPx
, nextNx
where to restart the search
if the current match fails.
Each subsequent star encountered overwrites the restart information
for the previous star, in effect locking in the choice made for the
previous star.
If you’d like to experiment, I’ve posted both of these programs and a test harness.
An alternate implementation of the algorithm would be to split the pattern on stars and then consider each of the star-separated subpatterns in turn, special-casing the first subpattern and the last, which must be anchored at the start and end of the name, respectively. Go’s src/path/filepath/match.go is an example of this implementation.
Go’s filepath.Match
(used by filepath.Glob
),
glibc’s glob
, and vsftpd’s pattern matcher
all use this algorithm.
A more straightforward approach is to translate the glob pattern to
a regular expression and then invoke a linear-time regular expression match.
Plan 9’s ftpd does this, as does the ls
|
grep
example above.
(Python also does this, but then it runs the regular expression
with an exponential-time matching engine.)
I have not looked at the other linear-time implementations to see what they do,
but I expect they all use one of these two approaches.
Additional Reading
This post is an elaboration of an informal 2012 Google+ post
showing that most shells used exponential-time glob expansion.
At the time, Tom Duff, the author of Plan 9’s rc shell, commented that,
“I can confirm that rc gets it wrong.
My excuse, feeble as it is, is that doing it that way meant that the code took 10 minutes to write,
but it took 20 years for someone to notice the problem.
(That’s 10 ‘programmer minutes’, i.e. less than a day.)”
I agree that’s a reasonable decision for a shell.
In contrast, a language library routine, not to mention a network server,
today needs to be robust against worst-case inputs that might be controlled
by remote attackers,
but nearly all of the code in question predates that kind of concern.
I didn’t realize the connection to FTP servers until I started doing
additional research for this post and came across a reference to
CVE-2010-2632 in FreeBSD’s glob
implementation.
Dave Presotto, the author of Plan 9’s ftpd, avoided rc’s glob implementation not because of the algorithmic complexity but to make sure that long strings were handled safely (using a Plan 9 library for handling long strings in C). He wrote this comment at the top of Plan 9’s /sys/src/cmd/ip/glob.c, which converts glob expressions into regular expressions:
/* * I wrote this glob so that there would be no limit * on element or path size. The one in rc is probably * better, certainly faster. - presotto */
As it turns out, this is untrue: the one in rc is certainly slower, at least in terms of worst-case asymptotics.
If you liked this post, you may also like to browse
Nelson Elhage’s Accidentally Quadratic collection.
(Of course, here the glob pattern matching is accidentally exponential.)
Updates, as of April 28, 2017.
NetBSD has an updated glob pending for the next release.
Perl has an updated glob pending for the next release.
Pure-FTPd 1.0.46 has added a check that patterns must not have more than three stars.
Thanks to js2 on Hacker News for pointing out that Python translates globs to regular expressions.