Why is this regular expression so slow in Java?

AntonPiatek
:

I recently had a SonarQube rule (https://rules.sonarsource.com/java/RSPEC-4784) bring to my attention some performance issues which could be used as a denial of service against a Java regular expression implementation.

Indeed, the following Java test shows how slow the wrong regular expression can be:

    import org.junit.Test;

    public class RegexTest {

    @Test
    public void fastRegex1() {
        "aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)b");
    }

    @Test
    public void fastRegex2() {
        "aaaaaaaaaaaaaaaaaaaaaaaaaaaab".matches("(a+)+b");
    }

    @Test
    public void slowRegex() {
        "aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b");
    }
}

As you can see, the first two tests are fast, the third one is incredibly slow (in Java 8)

Enter image description here

The same data and regex in Perl or Python, however, is not at all slow, which leads me to wonder why it is that this regular expression is so slow to evaluate in Java.

$ time perl -e '"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs" =~ /(a+)+b/ && print "$1\n"'
aaaaaaaaaaaaaaaaaaaaaaaaaaaa

real    0m0.004s
user    0m0.000s
sys     0m0.004s

$ time python3 -c 'import re; m=re.search("(a+)+b","aaaaaaaaaaaaaaaaaaaaaaaaaaaabs"); print(m.group(0))'
aaaaaaaaaaaaaaaaaaaaaaaaaaaab

real    0m0.018s
user    0m0.015s
sys     0m0.004s

What is it about the extra matching modifier + or trailing character s in the data which makes this regular expression so slow, and why is it only specific to Java?

Andy Turner
:

Caveat: I don't really know much about regex internals, and this is really conjecture. And I can't answer why Java suffers from this, but not the others (also, it is substantially faster than your 12 seconds in jshell 11 when I run it, so it perhaps only affects certain versions).

"aaaaaaaaaaaaaaaaaaaaaaaaaaaabs".matches("(a+)+b")

There are lots of ways that lots of as could match:

(a)(a)(a)(a)
(aa)(a)(a)
(a)(aa)(a)
(aa)(aa)
(a)(aaa)
etc.

For the input string "aaaaaaaaaaaaaaaaaaaaaaaaaaaab", it will greedily match all of those as in a single pass, match the b, job done.

For "aaaaaaaaaaaaaaaaaaaaaaaaaaaabs", when it gets to the end and finds that the string doesn't match (because of the s), it's not correctly recognizing that the s means it can never match. So, having gone through and likely matched as

(aaaaaaaaaaaaaaaaaaaaaaaaaaaa)bs

it thinks "Oh, maybe it failed because of the way I grouped the as - and goes back and tries all the other combinations of the as.

(aaaaaaaaaaaaaaaaaaaaaaaaaaa)(a)bs  // Nope, still no match
(aaaaaaaaaaaaaaaaaaaaaaaaaa)(aa)bs  // ...
(aaaaaaaaaaaaaaaaaaaaaaaaa)(aaa)bs  // ...
...
(a)(aaaaaaaaaaaaaaaaaaaaaaaaaaa)bs  // ...
(aaaaaaaaaaaaaaaaaaaaaaaaaa(a)(a)bs  // ...
(aaaaaaaaaaaaaaaaaaaaaaaaa(aa)(a)bs  // ...
(aaaaaaaaaaaaaaaaaaaaaaaa(aaa)(a)bs  // ...
...

There are lots of these (I think there are something like 2^27 - that's 134,217,728 - combinations for 28 as, because each a can either be part of the previous group, or start its own group), so it takes a long time.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related