源于 python
Stemming in python
我想截取我正在从 CSV 文件中读取的文本。但是在词干运算符之后文本没有改变。比起我在某处读到我需要使用 POS 标签来阻止但它没有帮助。
你能告诉我我做错了什么吗?所以我正在阅读 csv,删除标点符号,标记化,获取 POS 标签,并尝试阻止但没有任何改变。
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
import nltk
from nltk import pos_tag
stemmer = nltk.PorterStemmer()
data = pd.read_csv(open('data.csv'),sep=';')
translator=str.maketrans('','',string.punctuation)
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=';',
quotechar='^', quoting=csv.QUOTE_MINIMAL)
for line in data['sent']:
line = line.translate(translator)
tokens = word_tokenize(line)
tokens_pos = nltk.pos_tag(tokens)
final = [stemmer.stem(tagged_word[0]) for tagged_word in tokens_pos]
writer.writerow(tokens_pos)
词干提取数据示例:
The question was, what are you going to cut?
Well, again, while you were on the board of the Woods Foundation...
We've got some long-term challenges in this economy.
提前感谢您的帮助!
您的代码应该为所需的输出打印最终变量,而不是打印 tokens_pos :)
尝试以下操作:
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
def preprocess(sentence):
stemmer = nltk.PorterStemmer()
translator=sentence.translate(string.maketrans("",""), string.punctuation)
translator = translator.lower()
tokens = word_tokenize(translator)
final = [stemmer.stem(tagged_word) for tagged_word in tokens]
return " ".join(final)
sentence = "We've got some long-term challenges in this economy."
print "Original: "+ sentence
stemmed=preprocess(sentence)
print "Processed: "+ stemmed
输出:
Original: We've got some long-term challenges in this economy.
Processed: weve got some longterm challeng in thi economi
希望对您有所帮助!
您应该尝试调试您的代码。如果(在必要的导入之后)您刚刚尝试了 print(stemmer.stem("challenges"))
,您会发现词干提取 确实 起作用(以上将打印 "challeng")。您的问题是一个小疏忽:您在 final
中收集词干,但打印 tokens_pos
。所以 "solution" 是这样的:
writer.writerow(final)
我想截取我正在从 CSV 文件中读取的文本。但是在词干运算符之后文本没有改变。比起我在某处读到我需要使用 POS 标签来阻止但它没有帮助。
你能告诉我我做错了什么吗?所以我正在阅读 csv,删除标点符号,标记化,获取 POS 标签,并尝试阻止但没有任何改变。
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
import nltk
from nltk import pos_tag
stemmer = nltk.PorterStemmer()
data = pd.read_csv(open('data.csv'),sep=';')
translator=str.maketrans('','',string.punctuation)
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=';',
quotechar='^', quoting=csv.QUOTE_MINIMAL)
for line in data['sent']:
line = line.translate(translator)
tokens = word_tokenize(line)
tokens_pos = nltk.pos_tag(tokens)
final = [stemmer.stem(tagged_word[0]) for tagged_word in tokens_pos]
writer.writerow(tokens_pos)
词干提取数据示例:
The question was, what are you going to cut?
Well, again, while you were on the board of the Woods Foundation...
We've got some long-term challenges in this economy.
提前感谢您的帮助!
您的代码应该为所需的输出打印最终变量,而不是打印 tokens_pos :)
尝试以下操作:
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
def preprocess(sentence):
stemmer = nltk.PorterStemmer()
translator=sentence.translate(string.maketrans("",""), string.punctuation)
translator = translator.lower()
tokens = word_tokenize(translator)
final = [stemmer.stem(tagged_word) for tagged_word in tokens]
return " ".join(final)
sentence = "We've got some long-term challenges in this economy."
print "Original: "+ sentence
stemmed=preprocess(sentence)
print "Processed: "+ stemmed
输出:
Original: We've got some long-term challenges in this economy.
Processed: weve got some longterm challeng in thi economi
希望对您有所帮助!
您应该尝试调试您的代码。如果(在必要的导入之后)您刚刚尝试了 print(stemmer.stem("challenges"))
,您会发现词干提取 确实 起作用(以上将打印 "challeng")。您的问题是一个小疏忽:您在 final
中收集词干,但打印 tokens_pos
。所以 "solution" 是这样的:
writer.writerow(final)