Have you ever heard of regular expression denial of service (ReDoS)? It is a kind of attack which could be used to drown the server by regular expressions. It is based on the fact that regular expressions have a backtracking feature and may takes a long time to execute at certain circumstances. This kind of backtracking is also called catastrophic backtracking.
So what is backtracking in regular expressions? Normally, regular expressions match greedily. If there is no match, then the backtrack process starts. For example, say we an regular expression a*b
and we are given a string aaaaa
. The star character will try to match all of the character a
and then find there is no match. Then the star will try to backtrack, which means only match first 4 character a
, and then still find no match. Go on and on, until backtracking to empty string. So this simple example here actually takes 6 tries here.
aaaaa
aaaa
aaa
aa
a
Normally, this backtracking should not cause any trouble. As you can see here, the try times depend on the length of the string. So complexity is O(n), which means ok for most cases.
The problem become evident when we have a double backtrack. Let's see another example. Say we have an regular expression (a*)*b
. As you can see, one star in inside another star, it will cause a double backtrack.
If the input string is just a
, then it will try 2 times. That's ok.
a
(a), ()
If the input string is aa
, then it will try 4 times.
aa
(aa), (a)(a), (a)(), ()
As you can see, not only the first part is backtracking, the second part is begin to backtrack also.
If the input string is aaa
, then it will try 8 times.
aaa
(aaa)
(aa)(a)
(aa)
(a)(aa)
(a)(a)(a)
(a)(a)
(a)
()
Now we have three parts, all of them have to go through backtracking.
To make it longer as aaaa
, then it will try 16 times.
aaaa
(aaaa)
(aaa)(a)
(aaa)
(aa)(aa)
(aa)(a)(a)
(aa)(a)
(aa)
(a)(aaa)
(a)(aa)(a)
(a)(aa)
(a)(a)(aa)
(a)(a)(a)(a)
(a)(a)(a)
(a)(a)
(a)
()
This looks bad. As you can see, with double backtracking, the complexity becomes 2^n
.
So if now the attacker finds this weakness, then make a longer input string, then server may goes into 100% cpu and takes a long time to execute this regular expression.
That's it. Keep this in mind when writing regular expressions.