将 Averaged Perceptron Tagger POS 转换为 WordNet POS 并避免元组错误

Convert Averaged Perceptron Tagger POS to WordNet POS and Avoid Tuple Error

我有使用 NLTK 的平均感知器标注器进行词性标注的代码:

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

string = 'dogs runs fast'

tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)

结果:

[('dogs', 'NNS'), ('runs', 'VBZ'), ('fast', 'RB')]

我尝试编写代码来循环遍历每个标记的标记并使用 WordNet 词形还原器对其进行词形还原:

lemmatizedWords = []
for w in tokensPOS:
       lemmatizedWords.append(WordNetLemmatizer().lemmatize(w))

print(lemmatizedWords)

产生的错误:

Traceback (most recent call last):

  File "<ipython-input-30-462d7c3bdbb7>", line 15, in <module>
    lemmatizedWords = WordNetLemmatizer().lemmatize(w)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\stem\wordnet.py", line 40, in lemmatize
    lemmas = wordnet._morphy(word, pos)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1712, in _morphy
    forms = apply_rules([form])

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1692, in apply_rules
    for form in forms

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1694, in <listcomp>
    if form.endswith(old)]

AttributeError: 'tuple' object has no attribute 'endswith'

我觉得这里有两个问题:

  1. POS 标签未转换为 WordNet 可以理解的标签(我尝试实现类似于此答案的内容 wordnet lemmatization and pos tagging in python 但没有成功)
  2. 数据结构的格式不正确,无法遍历每个元组(除了 os 相关代码外,我找不到太多关于此错误的信息)

如何使用词形还原来跟进 POS 标记以避免这些错误?

Python 解释器明确告诉你:

AttributeError: 'tuple' object has no attribute 'endswith'

tokensPOS是一个元组数组,所以你不能直接将它的元素传递给lemmatize()方法(看classWordNetLemmatizer[=22的代码=]).只有字符串类型的对象有方法 endswith(),所以你需要从 tokenPOS 传递每个元组的第一个元素,就像这样:

lemmatizedWords = []
for w in tokensPOS:
    lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0]))   

方法 lemmatize() 使用 wordnet.NOUN 作为默认 POS。不幸的是,Wordnet 使用与其他 nltk 语料库不同的标签,因此您必须手动翻译它们(如 link 您提供的那样)并使用适当的标签作为 lemmatize() 的第二个参数。完整脚本,方法 get_wordnet_pos() 来自 this answer:

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

string = 'dogs runs fast'

tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)

lemmatizedWords = []
for w in tokensPOS:
    lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0],get_wordnet_pos(w[1])))

print(lemmatizedWords)