源于 python

Stemming in python

我想截取我正在从 CSV 文件中读取的文本。但是在词干运算符之后文本没有改变。比起我在某处读到我需要使用 POS 标签来阻止但它没有帮助。

你能告诉我我做错了什么吗?所以我正在阅读 csv,删除标点符号,标记化,获取 POS 标签,并尝试阻止但没有任何改变。

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
import nltk
from nltk import pos_tag

stemmer = nltk.PorterStemmer()
data = pd.read_csv(open('data.csv'),sep=';')

translator=str.maketrans('','',string.punctuation)

with open('output.csv', 'w', newline='') as csvfile:
   writer = csv.writer(csvfile, delimiter=';',
                            quotechar='^', quoting=csv.QUOTE_MINIMAL)

   for line in data['sent']:
        line = line.translate(translator)
        tokens = word_tokenize(line)
        tokens_pos = nltk.pos_tag(tokens)
        final = [stemmer.stem(tagged_word[0]) for tagged_word in tokens_pos]
        writer.writerow(tokens_pos)

词干提取数据示例:

The question was, what are you going to cut?
Well, again, while you were on the board of the Woods Foundation...
We've got some long-term challenges in this economy.

提前感谢您的帮助!

您的代码应该为所需的输出打印最终变量,而不是打印 tokens_pos :)

尝试以下操作:

import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer


def preprocess(sentence):
    stemmer = nltk.PorterStemmer()
    translator=sentence.translate(string.maketrans("",""), string.punctuation)
    translator = translator.lower()
    tokens = word_tokenize(translator)
    final = [stemmer.stem(tagged_word) for tagged_word in tokens]
    return " ".join(final)

sentence = "We've got some long-term challenges in this economy."
print "Original: "+ sentence

stemmed=preprocess(sentence)
print "Processed: "+ stemmed

输出:

Original: We've got some long-term challenges in this economy.
Processed: weve got some longterm challeng in thi economi

希望对您有所帮助!

您应该尝试调试您的代码。如果(在必要的导入之后)您刚刚尝试了 print(stemmer.stem("challenges")),您会发现词干提取 确实 起作用(以上将打印 "challeng")。您的问题是一个小疏忽:您在 final 中收集词干,但打印 tokens_pos。所以 "solution" 是这样的:

writer.writerow(final)