将 Averaged Perceptron Tagger POS 转换为 WordNet POS 并避免元组错误
Convert Averaged Perceptron Tagger POS to WordNet POS and Avoid Tuple Error
我有使用 NLTK 的平均感知器标注器进行词性标注的代码:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize
string = 'dogs runs fast'
tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)
结果:
[('dogs', 'NNS'), ('runs', 'VBZ'), ('fast', 'RB')]
我尝试编写代码来循环遍历每个标记的标记并使用 WordNet 词形还原器对其进行词形还原:
lemmatizedWords = []
for w in tokensPOS:
lemmatizedWords.append(WordNetLemmatizer().lemmatize(w))
print(lemmatizedWords)
产生的错误:
Traceback (most recent call last):
File "<ipython-input-30-462d7c3bdbb7>", line 15, in <module>
lemmatizedWords = WordNetLemmatizer().lemmatize(w)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\stem\wordnet.py", line 40, in lemmatize
lemmas = wordnet._morphy(word, pos)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1712, in _morphy
forms = apply_rules([form])
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1692, in apply_rules
for form in forms
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1694, in <listcomp>
if form.endswith(old)]
AttributeError: 'tuple' object has no attribute 'endswith'
我觉得这里有两个问题:
- POS 标签未转换为 WordNet 可以理解的标签(我尝试实现类似于此答案的内容 wordnet lemmatization and pos tagging in python 但没有成功)
- 数据结构的格式不正确,无法遍历每个元组(除了
os
相关代码外,我找不到太多关于此错误的信息)
如何使用词形还原来跟进 POS 标记以避免这些错误?
Python 解释器明确告诉你:
AttributeError: 'tuple' object has no attribute 'endswith'
tokensPOS
是一个元组数组,所以你不能直接将它的元素传递给lemmatize()
方法(看classWordNetLemmatizer
[=22的代码=]).只有字符串类型的对象有方法 endswith()
,所以你需要从 tokenPOS
传递每个元组的第一个元素,就像这样:
lemmatizedWords = []
for w in tokensPOS:
lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0]))
方法 lemmatize()
使用 wordnet.NOUN
作为默认 POS。不幸的是,Wordnet 使用与其他 nltk 语料库不同的标签,因此您必须手动翻译它们(如 link 您提供的那样)并使用适当的标签作为 lemmatize()
的第二个参数。完整脚本,方法 get_wordnet_pos()
来自 this answer:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return ''
string = 'dogs runs fast'
tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)
lemmatizedWords = []
for w in tokensPOS:
lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0],get_wordnet_pos(w[1])))
print(lemmatizedWords)
我有使用 NLTK 的平均感知器标注器进行词性标注的代码:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize
string = 'dogs runs fast'
tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)
结果:
[('dogs', 'NNS'), ('runs', 'VBZ'), ('fast', 'RB')]
我尝试编写代码来循环遍历每个标记的标记并使用 WordNet 词形还原器对其进行词形还原:
lemmatizedWords = []
for w in tokensPOS:
lemmatizedWords.append(WordNetLemmatizer().lemmatize(w))
print(lemmatizedWords)
产生的错误:
Traceback (most recent call last):
File "<ipython-input-30-462d7c3bdbb7>", line 15, in <module>
lemmatizedWords = WordNetLemmatizer().lemmatize(w)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\stem\wordnet.py", line 40, in lemmatize
lemmas = wordnet._morphy(word, pos)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1712, in _morphy
forms = apply_rules([form])
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1692, in apply_rules
for form in forms
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1694, in <listcomp>
if form.endswith(old)]
AttributeError: 'tuple' object has no attribute 'endswith'
我觉得这里有两个问题:
- POS 标签未转换为 WordNet 可以理解的标签(我尝试实现类似于此答案的内容 wordnet lemmatization and pos tagging in python 但没有成功)
- 数据结构的格式不正确,无法遍历每个元组(除了
os
相关代码外,我找不到太多关于此错误的信息)
如何使用词形还原来跟进 POS 标记以避免这些错误?
Python 解释器明确告诉你:
AttributeError: 'tuple' object has no attribute 'endswith'
tokensPOS
是一个元组数组,所以你不能直接将它的元素传递给lemmatize()
方法(看classWordNetLemmatizer
[=22的代码=]).只有字符串类型的对象有方法 endswith()
,所以你需要从 tokenPOS
传递每个元组的第一个元素,就像这样:
lemmatizedWords = []
for w in tokensPOS:
lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0]))
方法 lemmatize()
使用 wordnet.NOUN
作为默认 POS。不幸的是,Wordnet 使用与其他 nltk 语料库不同的标签,因此您必须手动翻译它们(如 link 您提供的那样)并使用适当的标签作为 lemmatize()
的第二个参数。完整脚本,方法 get_wordnet_pos()
来自 this answer:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return ''
string = 'dogs runs fast'
tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)
lemmatizedWords = []
for w in tokensPOS:
lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0],get_wordnet_pos(w[1])))
print(lemmatizedWords)