Python

Question

我知道类似的问题已经被问过好几次了，但我的问题有点不同，我正在寻找一个省时的解决方案，在 Python.

我有一组单词，有些以“*”结尾，有些则没有：

words = set(["apple", "cat*", "dog"])

我必须计算它们在文本中的总出现次数，考虑到任何东西都可以跟在星号之后（“cat*”表示所有以“cat”开头的词）。搜索必须不区分大小写。考虑这个例子：

text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS".

我希望最终得分为 4 (= cat* x 2 + dog + apple)。请注意，“cat*”被计算了两次，也考虑了复数形式，而“apple”只被计算了一次，因为它的复数没有被考虑（末尾没有星号）。

我必须对大量文档重复此操作，因此我需要一个快速的解决方案。我不知道 regex 或 flashtext 是否可以达到快速解决方案。你能帮帮我吗？

编辑

我忘了说我的一些词包含标点符号，例如：

words = set(["apple", "cat*", "dog", ":)", "I've"])

这似乎在编译正则表达式时产生了额外的问题。您已经提供的代码是否集成了适用于这两个附加词的代码？

Answer 1

您可以使用正则表达式来做到这一点，从单词集中创建一个正则表达式，在单词周围放置单词边界，但将尾随单词边界留在以 * 结尾的单词之外。编译正则表达式应该有助于提高性能：

import re

words = set(["apple", "cat*", "dog"])
text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS"

regex = re.compile('|'.join([r'\b' + w[:-1] if w.endswith('*') else r'\b' + w + r'\b' for w in words]), re.I)
matches = regex.findall(text)
print(len(matches))

输出：

Answer 2

为您要搜索的词创建一个Trie。

然后遍历要检查的字符串的字符。

每次到达树中的一片叶子时，增加计数器并跳到下一个单词。
每次没有路径时，跳到下一个单词。

Answer 3

免责声明： 我是 trrex

的作者

对于这个问题，如果您真的想要一个可扩展的解决方案，请使用 trie regex 而不是 union regex。请参阅此以获取解释。一种方法是使用 trrex，例如：

import trrex as tx
import re

words = {"apple", "cat*", "dog"}
text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS"

prefix_set = {w.replace('*', '') for w in words if w.endswith('*')}
full_set = {w for w in words if not w.endswith('*')}

prefix_pattern = re.compile(tx.make(prefix_set, right=''), re.IGNORECASE)  # '' as we only care about prefixes
full_pattern = re.compile(tx.make(full_set), re.IGNORECASE)

res = prefix_pattern.findall(text) + full_pattern.findall(text)
print(res)

输出

['cat', 'CAT', 'apple', 'dog']

有关 trrex 的特定用途，请参阅，那里描述的实验比原始联合正则表达式提高了 10 倍。 Trie 正则表达式利用常见前缀并创建最佳正则表达式，用于单词：

['baby', 'bat', 'bad']

它创建以下内容：

ba(?:by|[td])

Python - 快速计算字符串列表中文本中以

Python - Fast count words in text from list of strings and that start with

string

full-text-search

python-re