一个快速高效，不那么复杂的单词内容过滤器

Question

在没有进入贝叶斯级内容分类项目的情况下，我正在尝试为 Twitter 帐户制作一个非常简单的脏话过滤器。

本质上，我只是将用户的所有推文加入一个大文本 blob 并且 运行内容针对我的过滤器， 本质上是这样工作的：

badwords = ['bad', 'worse', 'momwouldbeangry', 'thousandsofperversesayings', 'xxx', 'etc']

s = 'Get free xxx etc'

score = 0

for b in badwords:
    if b in s:
        score = score+1

我有一个 3k 的坏词列表（我们生活在一个多么变态的世界！）理想情况下，我想创建一个分数，不仅基于单词的出现，而且基于每个单词出现的次数。所以如果这个词出现两次，分数就会增加两倍。

上面的分数生成器非常简单，但是会重新计算字符串数千次，而且它不会按照我想要的方式递增。

如何针对性能和准确性进行调整？

Answer 1

所以len(badwords) == 3000，所以用tweet_words = len(s.split()))就是len(tweet_words) < len(badwords)；因此

for b in badwords:
    if b in s:
        score = score+1

效率真的很低

要做的第一件事：将 badwords 变成 frozenset。这样，查找其中某物的出现会快得多。

然后，搜索 badwords 中的单词，而不是相反：

for t_word in tweet_words
    if t_word in badwords:
        score = score+1

然后，更实用一点！

score_function = lambda word: 0 if len(word) < 3 or (word not in badwords) else 1
score = lambda tweet: sum(score(lower(word)) for word in tweet.split())

这将比完整循环更快，因为 python 需要构造和销毁较少的临时上下文（这在技术上有点误导，但你节省了很多 cpython pyObject 创作） .

Answer 2

如果每个 badword 不能是一个子字符串，并且您想要对每个可以使用字典的单词进行计数，您还需要降低并删除用户推文中单词中的任何标点符号：

from string import punctuation
badwords = dict.fromkeys(('bad', 'worse', 'momwouldbeangry', 'thousandsofperversesayings', 'xxx', 'etc'),0)

s = 'Get free xxx! etc!!'

for word in s.split():
    word = word.lower().strip(punctuation)
    if word in badwords:
        badwords[word] += 1


print(badwords)
print(sum(badwords.values()))
{'momwouldbeangry': 0, 'xxx': 1, 'etc': 1, 'bad': 0, 'thousandsofperversesayings': 0, 'worse': 0}
2

如果您不关心出现的是什么词，只关心计数：

from string import punctuation
badwords = {'bad', 'worse', 'momwouldbeangry', 'thousandsofperversesayings', 'xxx', 'etc'}

s = 'Get free xxx! etc!!'

print(sum( word.lower().strip(punctuation)in badwords for word in s.split()))

Answer 3

尝试使用 collections.Counter;

In [1]: text = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum"""

In [2]: badwords = ['in', 'ex']

In [3]: from collections import Counter

In [9]: words = text.lower().split()

In [10]: c = Counter(words)

In [11]: c
Out[11]: Counter({'ut': 3, 'in': 3, 'dolore': 2, 'dolor': 2, 'adipiscing': 1, 'est': 1, 'exercitation': 1, 'aute': 1, 'proident,': 1, 'elit,': 1, 'irure': 1, 'consequat.': 1, 'minim': 1, 'pariatur.': 1, 'nostrud': 1, 'laboris': 1, 'occaecat': 1, 'lorem': 1, 'esse': 1, 'quis': 1, 'anim': 1, 'amet,': 1, 'ipsum': 1, 'laborum': 1, 'sunt': 1, 'qui': 1, 'incididunt': 1, 'culpa': 1, 'consectetur': 1, 'aliquip': 1, 'duis': 1, 'cillum': 1, 'excepteur': 1, 'cupidatat': 1, 'labore': 1, 'magna': 1, 'do': 1, 'fugiat': 1, 'reprehenderit': 1, 'ullamco': 1, 'ad': 1, 'commodo': 1, 'tempor': 1, 'non': 1, 'et': 1, 'ex': 1, 'deserunt': 1, 'sit': 1, 'eu': 1, 'voluptate': 1, 'mollit': 1, 'eiusmod': 1, 'aliqua.': 1, 'nulla': 1, 'sed': 1, 'sint': 1, 'nisi': 1, 'enim': 1, 'veniam,': 1, 'velit': 1, 'id': 1, 'officia': 1, 'ea': 1})

In [12]: scores = [v for k, v in c.items() if k in badwords]

In [13]: scores
Out[13]: [1, 3]

In [14]: sum(scores)
Out[14]: 4

一个快速高效，不那么复杂的单词内容过滤器

A fast and efficient, not-so complex word content filter

python

list

spam