如何只保留单词表中的名词词? python NLTK
How to keep only the noun words in a wordlist? python NLTK
我有一个单词表,里面有很多科目。主题是从句子中自动提取的。我只想保留主题中的名词。如您所见,有些主题有 adj,我想将其删除。
wordlist=['country','all','middle','various drinks','few people','its reputation','German Embassy','many elections']
returnlist=[]
for word in wordlist:
x=wn.synsets(word)
for syn in x:
if syn.pos() == 'n':
returnlist.append(word)
break
print returnlist
以上结果为:
['country','it', 'middle']
但是,我想要得到的结果应该是这样的
wordlist=['country','it', 'middle','drinks','people','reputation','German Embassy','elections']
怎么做?
adjectives = ['many', 'any', 'few', 'some', 'various'] # ...
wordlist = ['country','all','middle','various drinks','few people','its reputation','German Embassy','many elections']
returnlist = []
for word in wordlist:
for adj in adjectives:
word = word.lower().replace(adj, '').strip()
returnlist.append(word)
print(returnlist)
首先,您的列表是未正确标记化文本的结果,因此我再次对其进行了标记化
然后搜索所有单词的 pos
以查找 pos 包含 NN 的名词:
>>> text=' '.join(wordlist).lower()
>>> tokens = nltk.word_tokenize(text)
>>> tags = nltk.pos_tag(tokens)
>>> nouns = [word for word,pos in tags if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')
]
>>> nouns
['country', 'drinks', 'people', 'Embassy', 'elections']
我有一个单词表,里面有很多科目。主题是从句子中自动提取的。我只想保留主题中的名词。如您所见,有些主题有 adj,我想将其删除。
wordlist=['country','all','middle','various drinks','few people','its reputation','German Embassy','many elections']
returnlist=[]
for word in wordlist:
x=wn.synsets(word)
for syn in x:
if syn.pos() == 'n':
returnlist.append(word)
break
print returnlist
以上结果为:
['country','it', 'middle']
但是,我想要得到的结果应该是这样的
wordlist=['country','it', 'middle','drinks','people','reputation','German Embassy','elections']
怎么做?
adjectives = ['many', 'any', 'few', 'some', 'various'] # ...
wordlist = ['country','all','middle','various drinks','few people','its reputation','German Embassy','many elections']
returnlist = []
for word in wordlist:
for adj in adjectives:
word = word.lower().replace(adj, '').strip()
returnlist.append(word)
print(returnlist)
首先,您的列表是未正确标记化文本的结果,因此我再次对其进行了标记化
然后搜索所有单词的 pos
以查找 pos 包含 NN 的名词:
>>> text=' '.join(wordlist).lower()
>>> tokens = nltk.word_tokenize(text)
>>> tags = nltk.pos_tag(tokens)
>>> nouns = [word for word,pos in tags if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')
]
>>> nouns
['country', 'drinks', 'people', 'Embassy', 'elections']