Node/V8 中正则表达式匹配是如何实现的?

How is regular expression matching implemented in Node / V8?

我发现 an article 这表明正则表达式匹配通常是使用性能可能不佳的算法而不是建议的 Thompson NFA 算法来实现的。

考虑到这一点,这在 Node 或 V8 中是如何实现的?是否可以使用 Thompson NFA 的 JS 实现来提高性能,也许如果只使用有限的功能子集(可能删除先行或其他 "advanced" 功能)?

如本 announcement from Chrome's development team, V8 engine uses Irregexp 正则表达式引擎所述:

以下是关于该引擎实现的一些引用:

A fundamental decision we made early in the design of Irregexp was that we would be willing to spend extra time compiling a regular expression if that would make running it faster. During compilation Irregexp first converts a regexp into an intermediate automaton representation. This is in many ways the "natural" and most accessible representation and makes it much easier to analyze and optimize the regexp. For instance, when compiling /Sun|Mon/ the automaton representation lets us recognize that both alternatives have an 'n' as their third character. We can quickly scan the input until we find an 'n' and then start to match the regexp two characters earlier. Irregexp looks up to four characters ahead and matches up to four characters at a time.

After optimization we generate native machine code which uses backtracking to try different alternatives. Backtracking can be time-consuming so we use optimizations to avoid as much of it as we can. There are techniques to avoid backtracking altogether but the nature of regexps in JavaScript makes it difficult to apply them in our case, though it is something we may implement in the future.

因此 V8 确实编译为本地自动机表示 - 尽管它不使用 Thompson NFA。

关于性能,this article 将 V8 正则表达式性能与其他 libraries/languages 进行了比较。