如何检查一个单词在单词数组（Python/NLTK）中是否以复数形式比以单数形式更常见？

Question

我正在尝试做 NLTK 练习，但我不能做这个。 "Which nouns are more common in their plural form, rather than their singular form? (Only consider regular plurals, formed with the -s suffix.)"。我花了一天的时间思考这个问题并尝试了一些事情，但我就是做不到。谢谢。

Answer 1

拿一个语料库，做一个计数_:

>>> from collections import Counter
>>> from nltk.corpus import brown
>>> texts = brown.words()[:10000]
>>> word_counts = Counter(texts)
>>> word_counts['dollar']
5
>>> word_counts['dollars']
15

但请注意，有时仅使用表面字符串计数时会不清楚，例如

>>> texts = brown.words()[:10000]
>>> word_counts = Counter(texts)
>>> word_counts['hits']
14
>>> word_counts['hit']
34
>>> word_counts['needs']
14
>>> word_counts['need']
30

POS 敏感计数（参见类型与令牌）：

>>> texts = brown.tagged_words()[:10000]
>>> word_counts = Counter(texts)
>>> word_counts[('need', 'NN')]
6
>>> word_counts[('needs', 'NNS')]
3
>>> word_counts[('hit', 'NN')]
0
>>> word_counts[('hits', 'NNS')]
0

让我们进行一点逆向工程，brown 语料库很好，它在 NLTK 中进行了标记化和标记，但是如果你想使用自己的语料库，那么你必须考虑以下几点：

使用哪个语料库？如何标记化？如何添加 POS 标签？
你在数什么？类型还是标记？
如何处理POS歧义？如何区分名词和非名词？

最后，考虑一下：

是否真的有办法找出语言中一个词更常见的是复数还是单数？或者它总是与您选择分析的语料库有关吗？
是否存在某些名词不存在复数或单数形式的情况？（很可能答案是肯定的）。

Answer 2

brw是一个单词数组。

counter = Counter(brw);
plurals = [];
for word in brw:
    if(word[-1]!='s'):
        plural = counter[word+'s'];
        singul = counter[word];
        if(plural>singul):
            plurals.append(word+'s');

plurals 是输出数组，只有复数（重复，嗯）。如果我使用 set()，它们将不会重复。这样对吗？

如何检查一个单词在单词数组（Python/NLTK）中是否以复数形式比以单数形式更常见？

How to check if a word is more common in its plural form rather than in it's singular form in an array of words (with Python/NLTK)?

python

nltk