此正则表达式的对抗性输入

Question

有人问我一个面试题：写一个函数match(s, t)来判断一个字符串s是否是另一个字符串[=14]的广义子串 =].更具体地说，当且仅当删除 t 中的某些字符可以将其等同于 s 时，match 应该 return 为真。例如，match("abc", "abbbc")为True，因为我们可以去掉t.

中多出来的两个b

面试官肯定期待某种递归解决方案，但我感觉很冒险，所以写了

def match(s, t):
    return re.search('.*?'.join(s), t) is not None

这似乎满足了所有测试用例，但后来我开始怀疑是否存在任何可以占用此操作的对抗性输入（并可能使我们的服务容易受到 DoS 攻击）。有关详细讨论，请参阅 this great article，但作为一个简单的示例，请尝试

re.match("a?"*29+"a"*29, "a"*29)

并且需要几秒钟才能找到匹配项。原因是 re 实现了回溯正则表达式引擎：

Consider the regular expression a?ⁿaⁿ. It matches the string aⁿ when the a? are chosen not to match any letters, leaving the entire string to be matched by the aⁿ. Backtracking regular expression implementations implement the zero-or-one ? by first trying one and then zero. There are n such choices to make, a total of 2ⁿ possibilities. Only the very last possibility—choosing zero for all the ?—will lead to a match. The backtracking approach thus requires O(2ⁿ) time, so it will not scale much beyond n=25.

回到面试问题，技术上 '.*?'.join(s) 给了我们至少 len(s) - 1 个选择，这意味着可以有一些类似于 "a?"*29+"a"*29 和 [=25] 的对抗性输入=] 以上，但经过反复试验后我未能找到它们。

问题：什么 s 和 t 会使 match 慢得离谱？

Answer 1

惰性量词通常对性能非常好，但 AFAIK 它们不会阻止病态的强调行为。

当正则表达式的开头与文本的开头匹配但匹配较早并且会在文本结尾处失败需要大量回溯以“修复”错误的早期惰性匹配时尤其如此正则表达式的开头。

在您的情况下，这是一个需要指数级步数的病态输入示例：

# This should take at least few minutes to compute
n = 12
match('ab'*(n+1), 'abbbbb'*n+'a')

这里是根据n的值Pythonmatch需要的步数和时间：

 n |  steps |    time
 1 |     44 |   2.4 µs
 2 |    374 |   6.4 µs
 3 |   2621 |  22.1 µs
 4 |  18353 | 131.3 µs
 5 | 126211 | 925.8 µs
 6 |    -   |   6.2 ms
 7 |    -   |  42.2 ms
 8 |    -   | 288.7 ms
 9 |    -   |  1.97 s

您可以在 this great website 上理解和分析生成的正则表达式匹配行为（在这种情况下更具体地说是回溯）。

Answer 2

您的解决方案因回溯而受到影响，这就是为什么接受的答案能够提供会大大减慢其速度的输入。

这是一个正则表达式解决方案，它与我之前的答案非常相似，而且它似乎也避免了困扰您当前解决方案的问题。

def match(s, t):
    return re.search("".join(f'[^{re.escape(c)}]*{re.escape(c)}' for c in s), t) is not None

上一个回答

这并没有直接回答你的问题，但是

Surely the interviewer is expecting some kind of recursive solution...

我不这么认为。这样的东西也可以工作

def match(s, t):
    i = -1
    return all ((i := t.find(c, i + 1)) != -1 for c in s)

我只是想提供一个不需要递归或正则表达式的替代解决方案，因此会导致您提到的漏洞。

此正则表达式的对抗性输入

Adversarial input for this regex

python

regex

security

performance

non-greedy

上一个回答