如何检查一个单词在单词数组(Python/NLTK)中是否以复数形式比以单数形式更常见?

How to check if a word is more common in its plural form rather than in it's singular form in an array of words (with Python/NLTK)?

我正在尝试做 NLTK 练习,但我不能做这个。 "Which nouns are more common in their plural form, rather than their singular form? (Only consider regular plurals, formed with the -s suffix.)"。我花了一天的时间思考这个问题并尝试了一些事情,但我就是做不到。 谢谢。

拿一个语料库,做一个计数_:

>>> from collections import Counter
>>> from nltk.corpus import brown
>>> texts = brown.words()[:10000]
>>> word_counts = Counter(texts)
>>> word_counts['dollar']
5
>>> word_counts['dollars']
15

但请注意,有时仅使用表面字符串计数时会不清楚,例如

>>> texts = brown.words()[:10000]
>>> word_counts = Counter(texts)
>>> word_counts['hits']
14
>>> word_counts['hit']
34
>>> word_counts['needs']
14
>>> word_counts['need']
30

POS 敏感计数(参见类型与令牌):

>>> texts = brown.tagged_words()[:10000]
>>> word_counts = Counter(texts)
>>> word_counts[('need', 'NN')]
6
>>> word_counts[('needs', 'NNS')]
3
>>> word_counts[('hit', 'NN')]
0
>>> word_counts[('hits', 'NNS')]
0

让我们进行一点逆向工程,brown 语料库很好,它在 NLTK 中进行了标记化和标记,但是如果你想使用自己的语料库,那么你必须考虑以下几点:

  • 使用哪个语料库?如何标记化?如何添加 POS 标签?
  • 你在数什么?类型还是标记?
  • 如何处理POS歧义?如何区分名词和非名词?

最后,考虑一下:

  • 是否真的有办法找出语言中一个词更常见的是复数还是单数?或者它总是与您选择分析的语料库有关吗?
  • 是否存在某些名词不存在复数或单数形式的情况? (很可能答案是肯定的)。

brw是一个单词数组。

counter = Counter(brw);
plurals = [];
for word in brw:
    if(word[-1]!='s'):
        plural = counter[word+'s'];
        singul = counter[word];
        if(plural>singul):
            plurals.append(word+'s');

plurals 是输出数组,只有复数(重复,嗯)。如果我使用 set(),它们将不会重复。这样对吗?