Why is regular expression matching so slow

Rastapopoulos

I learned from this article about ripgrep that regular expression engines that implement backtracking can be very slow in some cases, but I don't really understand why. Could someone explain in simple terms why the following python snippet, given as an example in the article linked, is very slow?

>>> import re
>>> re.search('(a*)*c', 'a' * 30)
ilkkachu

Basically, the issue is with the doubled repetition of the a in the pattern. The part a* allows for any number of a's, while the surrounding (·)* also allows any number of the contained pattern.

This allows for a huge number of possible ways to match the pattern against a string of a's. Ignoring the b for now, a string like aaaaa (five a's) could be matched as (aaaaa), (aaaa)(a), (aaa)(aa), (aaa)(a)(a), (aa)(aaa), (aa)(aa)(a) ... There's an exponential number of ways to match the string.

With the b at the end, a backtracking engine will try one way of matching the a's, realizes it doesn't find the b, goes back one step, tries another way, realizes it can't find the b, ... and takes a long time to exhaust all possible arrangements, after which it fails.


There are much better texts on this subject online than I could ever write. Go read these:


In practice, if you can, avoid patterns that allow for multiple ways to match a string. The example here, (a*)*c, is obviously silly, since it's equivalent to a*c which doesn't have the nested repetition.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related