NLTK 查找德语名词
NLTK find german nouns
我想使用 NLTK 从德语文本中以词形还原形式提取所有德语名词。
我还检查了 spacy,但 NLTK 更受欢迎,因为在英语中它已经可以满足所需的性能和请求的数据结构。
我有以下英语工作代码:
import nltk
from nltk.stem import WordNetLemmatizer
#germanText='Jahrtausendelang ging man davon aus, dass auch der Sender einen geheimen Schlüssel, und zwar den gleichen wie der Empfänger, benötigt.'
text='For thousands of years it was assumed that the sender also needed a secret key, the same as the recipient.'
tokens = nltk.word_tokenize(text)
tokens = [tok.lower() for tok in tokens]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(tok) for tok in tokens]
tokens = [word for (word, pos) in nltk.pos_tag(tokens) if pos[0] == 'N']
print (tokens)
我得到了预期的打印结果:
['year', 'sender', 'key', 'recipient']
现在我试着为德语做这个:
import nltk
from nltk.stem import WordNetLemmatizer
germanText='Jahrtausendelang ging man davon aus, dass auch der Sender einen geheimen Schlüssel, und zwar den gleichen wie der Empfänger, benötigt.'
#text='For thousands of years it was assumed that the sender also needed a secret key, the same as the recipient.'
tokens = nltk.word_tokenize(germanText, language='german')
tokens = [tok.lower() for tok in tokens]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(tok) for tok in tokens]
tokens = [word for (word, pos) in nltk.pos_tag(tokens) if pos[0] == 'N']
print (tokens)
我得到了一个错误的结果:
['jahrtausendelang', 'man', 'davon', 'au', 'der', 'sender', 'einen', 'geheimen', 'zwar', 'den', 'gleichen', 'wie', 'der', 'empfänger', 'benötigt']
词形还原无效,名词提取无效。
将不同语言应用到此代码的正确方法是什么?
我还检查了其他解决方案,例如:
from nltk.stem.snowball import GermanStemmer
stemmer = GermanStemmer("german") # Choose a language
tokenGer=stemmer.stem(tokens)
但这会让我从头开始。
我找到了使用 HanoverTagger 的方法:
from HanTa import HanoverTagger as ht
tagger = ht.HanoverTagger('morphmodel_ger.pgz')
words = nltk.word_tokenize(text)
print(tagger.tag_sent(words) )
tokens=[word for (word,x,pos) in tagger.tag_sent(words,taglevel= 1) if pos == 'NN']
我得到了预期的结果:['Jahrtausendelang', 'Sender', 'Schlüssel', 'Empfänger']
我想使用 NLTK 从德语文本中以词形还原形式提取所有德语名词。
我还检查了 spacy,但 NLTK 更受欢迎,因为在英语中它已经可以满足所需的性能和请求的数据结构。
我有以下英语工作代码:
import nltk
from nltk.stem import WordNetLemmatizer
#germanText='Jahrtausendelang ging man davon aus, dass auch der Sender einen geheimen Schlüssel, und zwar den gleichen wie der Empfänger, benötigt.'
text='For thousands of years it was assumed that the sender also needed a secret key, the same as the recipient.'
tokens = nltk.word_tokenize(text)
tokens = [tok.lower() for tok in tokens]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(tok) for tok in tokens]
tokens = [word for (word, pos) in nltk.pos_tag(tokens) if pos[0] == 'N']
print (tokens)
我得到了预期的打印结果:
['year', 'sender', 'key', 'recipient']
现在我试着为德语做这个:
import nltk
from nltk.stem import WordNetLemmatizer
germanText='Jahrtausendelang ging man davon aus, dass auch der Sender einen geheimen Schlüssel, und zwar den gleichen wie der Empfänger, benötigt.'
#text='For thousands of years it was assumed that the sender also needed a secret key, the same as the recipient.'
tokens = nltk.word_tokenize(germanText, language='german')
tokens = [tok.lower() for tok in tokens]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(tok) for tok in tokens]
tokens = [word for (word, pos) in nltk.pos_tag(tokens) if pos[0] == 'N']
print (tokens)
我得到了一个错误的结果:
['jahrtausendelang', 'man', 'davon', 'au', 'der', 'sender', 'einen', 'geheimen', 'zwar', 'den', 'gleichen', 'wie', 'der', 'empfänger', 'benötigt']
词形还原无效,名词提取无效。
将不同语言应用到此代码的正确方法是什么?
我还检查了其他解决方案,例如:
from nltk.stem.snowball import GermanStemmer
stemmer = GermanStemmer("german") # Choose a language
tokenGer=stemmer.stem(tokens)
但这会让我从头开始。
我找到了使用 HanoverTagger 的方法:
from HanTa import HanoverTagger as ht
tagger = ht.HanoverTagger('morphmodel_ger.pgz')
words = nltk.word_tokenize(text)
print(tagger.tag_sent(words) )
tokens=[word for (word,x,pos) in tagger.tag_sent(words,taglevel= 1) if pos == 'NN']
我得到了预期的结果:['Jahrtausendelang', 'Sender', 'Schlüssel', 'Empfänger']