I learned from this article about ripgrep
that regular expression engines that implement backtracking can be very slow in some cases, but I don't really understand why. Could someone explain in simple terms why the following python snippet, given as an example in the article linked, is very slow?
>>> import re
>>> re.search('(a*)*c', 'a' * 30)
Basically, the issue is with the doubled repetition of the a
in the pattern. The part a*
allows for any number of a
's, while the surrounding (·)*
also allows any number of the contained pattern.
This allows for a huge number of possible ways to match the pattern against a string of a
's. Ignoring the b
for now, a string like aaaaa
(five a
's) could be matched as (aaaaa)
, (aaaa)(a)
, (aaa)(aa)
, (aaa)(a)(a)
, (aa)(aaa)
, (aa)(aa)(a)
... There's an exponential number of ways to match the string.
With the b
at the end, a backtracking engine will try one way of matching the a
's, realizes it doesn't find the b
, goes back one step, tries another way, realizes it can't find the b
, ... and takes a long time to exhaust all possible arrangements, after which it fails.
There are much better texts on this subject online than I could ever write. Go read these:
Runaway Regular Expressions: Catastrophic Backtracking by Jan Goyvaerts describes the issue and some ways to prevent it.
Regular Expression Matching Can Be Simple And Fast (but...) by Russ Cox also describes the issue, as well as implementing regexes as finite automata, without using backtracking and therefore immune to this problem. It also has pictures.
In practice, if you can, avoid patterns that allow for multiple ways to match a string. The example here, (a*)*c
, is obviously silly, since it's equivalent to a*c
which doesn't have the nested repetition.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments