多扩展字符串匹配算法

Algorithm for multiple extended string matching

我需要为文本中的多个扩展字符串匹配实现一个算法。

Extended表示存在通配符(任意数量的字符而不是星号),例如:

abc*def //matches abcdef, abcpppppdef etc.

Multiple 表示同时搜索多个字符串模式(不是对每个模式单独搜索),例如:

abc*def
abc
whatever
some*string

问题:

可以进行多个扩展字符串匹配的快速算法是什么?

最好针对 SIMD 指令和多核实现进行优化。开源实现 (C/C++/Python) 也很棒。

谢谢

我认为从阅读以下维基百科文章的部分开始可能有意义:http://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times。然后,您可以对算法进行文献综述,实现 正则表达式模式匹配 .

在实用实现方面,有多种正则表达式 (regex) 引擎以库的形式存在,重点在一种或多种编程语言上。最有可能的是,最好和最受欢迎的选择是 C/C++ PCRE library, with its newest version PCRE2, released in 2015. Another C++ regex library, which is quite popular at Google, is RE2. I recommend you to read this paper, along with the two other, linked within the article, for details on algorithms, implementation and benchmarks. Just recently, Google has released RE2/J - a linear time version of RE2 for Java: see this blog post for details. Finally, I ran across an interesting pure C regex library TRE, which offers way too many cool features to list here. However, you can read about them all on this page.

P.S. 如果以上内容对您来说还不够,请随时访问 this Wikipedia page 了解更多正则表达式的详细信息 engines/libraries以及他们在几个标准上的比较。希望我的回答对你有所帮助。