Why is regular expression matching so slow

debugcn Published at Dev

Rastapopoulos

I learned from this article about ripgrep that regular expression engines that implement backtracking can be very slow in some cases, but I don't really understand why. Could someone explain in simple terms why the following python snippet, given as an example in the article linked, is very slow?

>>> import re
>>> re.search('(a*)*c', 'a' * 30)

ilkkachu

Basically, the issue is with the doubled repetition of the a in the pattern. The part a* allows for any number of a's, while the surrounding (·)* also allows any number of the contained pattern.

This allows for a huge number of possible ways to match the pattern against a string of a's. Ignoring the b for now, a string like aaaaa (five a's) could be matched as (aaaaa), (aaaa)(a), (aaa)(aa), (aaa)(a)(a), (aa)(aaa), (aa)(aa)(a) ... There's an exponential number of ways to match the string.

With the b at the end, a backtracking engine will try one way of matching the a's, realizes it doesn't find the b, goes back one step, tries another way, realizes it can't find the b, ... and takes a long time to exhaust all possible arrangements, after which it fails.

There are much better texts on this subject online than I could ever write. Go read these:

Runaway Regular Expressions: Catastrophic Backtracking by Jan Goyvaerts describes the issue and some ways to prevent it.
Regular Expression Matching Can Be Simple And Fast (but...) by Russ Cox also describes the issue, as well as implementing regexes as finite automata, without using backtracking and therefore immune to this problem. It also has pictures.

In practice, if you can, avoid patterns that allow for multiple ways to match a string. The example here, (a*)*c, is obviously silly, since it's equivalent to a*c which doesn't have the nested repetition.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-07-11

Comments

0 comments

From Java

Related Related

Article

Why is regular expression matching so slow

Why is regular expression matching so slow

Why is this regular expression so slow in Java?

Why is this regular expression not matching?

Why is the regular expression not matching the string?

Why is this regular expression matching giving this result?

Regular Expressions Java, why is this regex so slow?

Why does LWP::Simple::get slow down a subsequent regular expression?

Regular Expression Not Matching Correctly

NGR Regular Expression Matching

Regular expression matching (Javascript)

matching regular expression in python

Regular expression not matching string

Regular expression for matching a sequence?

Regular expression for matching either or

MySQL regular expression matching

PHP Regular Expression not matching

Regular expression matching in queryDSL

PHP regular expression matching

Matching regular expression in android

Regular Expression Matching - Tracks

Partial matching of Regular expression

Regular expression matching with string

Context of the matching regular expression

matching regular expression in python

Regular expression matching sentence

Regular expression matching javascript

Regular Expression Matching and Replacing

Regular Expression for matching word

Matching regular expression OR and AND

Matching Regular Expression of Emoticons