Python NLTK：搜索单词的出现

Question

我使用棕色语料库 "brown.words()"，它给了我一个包含 1161192 个单词的列表。

现在我想找到单词 "have" 的任何出现，所以只要在语料库中有 "has"、"had"、"haven't" 等。我想做点什么（可能是将它们推入一个数组，可能是一个计数器，可能是其他东西。

编辑：请注意，这道题是关于寻找匹配词的。如果我 搜索 "have" 我想要一种方法将它匹配到 "haven't" 或 "had"，因此 .count() 不会解决这个问题 作为它对匹配任何东西都没有帮助。

如果 stemming/lemmatization 可行，我将使用的示例代码：

def findWordFamily(findWord):
    wordFamily = []

    lmtzr = WordNetLemmatizer()

    findWord = lmtzr.lemmatize(findWord)
    for word in brown.words():
        lemma = lmtzr.lemmatize(word)
        if lemma == findWord:
            wordFamily.append(word)

    return wordFamily
print(findWordFamily("have"))
# ["have", "have", "had", "having","haven't", "having"]

但问题是：

for word in brown.words():
    lemma = lmtzr.lemmatize(word)
    # if word is "having" lemma also is "having" instead of "have"

Answer 1

在尝试匹配单词之前，您可能需要做一些 pre-processing。所以 "has" 或 "haven't" 最终变成 "transformed" 到 "have"。

我建议您看一下词干提取或词形还原：

NLTK 的 Wordnet Lemmatizer（我的最爱之一）：http://www.nltk.org/_modules/nltk/stem/wordnet.html

NLTK 的词干分析器：http://www.nltk.org/howto/stem.html

注意：为了使词形还原器能够很好地处理动词，您必须指定它们实际上是动词。

nltk.stem.WordNetLemmatizer().lemmatize('having', 'v')

希望对您有所帮助！

Python NLTK：搜索单词的出现

Python NLTK: search for occurrence of a word

python

corpus

stemming

nltk

lemmatization