NLTK 查找德语名词

NLTK find german nouns

我想使用 NLTK 从德语文本中以词形还原形式提取所有德语名词。

我还检查了 spacy,但 NLTK 更受欢迎,因为在英语中它已经可以满足所需的性能和请求的数据结构。

我有以下英语工作代码:

import nltk
from nltk.stem import WordNetLemmatizer

#germanText='Jahrtausendelang ging man davon aus, dass auch der Sender einen geheimen Schlüssel, und zwar den gleichen wie der Empfänger, benötigt.'

text='For thousands of years it was assumed that the sender also needed a secret key, the same as the recipient.'

tokens = nltk.word_tokenize(text)
tokens = [tok.lower() for tok in tokens]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(tok) for tok in tokens]
tokens = [word for (word, pos) in nltk.pos_tag(tokens) if pos[0] == 'N']

print (tokens)

我得到了预期的打印结果: ['year', 'sender', 'key', 'recipient']

现在我试着为德语做这个:

import nltk
from nltk.stem import WordNetLemmatizer

germanText='Jahrtausendelang ging man davon aus, dass auch der Sender einen geheimen Schlüssel, und zwar den gleichen wie der Empfänger, benötigt.'
#text='For thousands of years it was assumed that the sender also needed a secret key, the same as the recipient.'

tokens = nltk.word_tokenize(germanText, language='german')
tokens = [tok.lower() for tok in tokens]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(tok) for tok in tokens]
tokens = [word for (word, pos) in nltk.pos_tag(tokens) if pos[0] == 'N']

print (tokens)

我得到了一个错误的结果: ['jahrtausendelang', 'man', 'davon', 'au', 'der', 'sender', 'einen', 'geheimen', 'zwar', 'den', 'gleichen', 'wie', 'der', 'empfänger', 'benötigt']

词形还原无效,名词提取无效。

将不同语言应用到此代码的正确方法是什么?

我还检查了其他解决方案,例如:

from nltk.stem.snowball import GermanStemmer
stemmer = GermanStemmer("german") # Choose a language
tokenGer=stemmer.stem(tokens)

但这会让我从头开始。

我找到了使用 HanoverTagger 的方法:

from HanTa import HanoverTagger as ht
tagger = ht.HanoverTagger('morphmodel_ger.pgz')
words = nltk.word_tokenize(text)
print(tagger.tag_sent(words) )
tokens=[word for (word,x,pos) in tagger.tag_sent(words,taglevel= 1) if pos == 'NN']

我得到了预期的结果:['Jahrtausendelang', 'Sender', 'Schlüssel', 'Empfänger']