停用词将负面评论变为正面评论。在文本摘要过程中删除停用词的好方法是什么?
Stop words changed negative review to positive ones. What is a good way to remove stop words in text summarization process?
dataframe我尝试从数据框中的两列中删除停用词(英语)。见截图。但是,我发现应用了这个流程之后,review的意义就变了。例如不推荐改为推荐。在保持原始文本的想法不变的情况下删除停用词的最佳方法是什么?这是我的代码和结果:
from nltk import word_tokenize
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
df['Text_after_removed_stopwords'] = df['Text'].apply(lambda x: '
'.join([word for word in x.split() if word not in (stop)]))
print()
print('###Text after removed
stopwords###'+'\n'+df['Text_after_removed_stopwords'][1])
print()
print('###Text before removed stopwords###'+'\n'+ df['Text'][1])
print()
df['Summary_after_removed_stopwords'] = df['Summary'].apply(lambda
x: ' '.join([word for word in x.split() if word not in (stop)]))
print('###Summary after removed stopwords###'+ '
\n'+df['Summary_after_removed_stopwords'][1])
print()
print('###Summary before removed stopwords###'+'\n'+df['Summary'][
1])
###Text after removed stopwords###
product arrived labeled jumbo salted peanutsthe peanuts actually
small sized unsalted sure error vendor intended represent product
jumbo
###Text before removed stopwords###
product arrived labeled as jumbo salted peanutsthe peanuts were
actually small sized unsalted not sure if this was an error or if
the vendor intended to represent the product as jumbo
###Summary after removed stopwords###
advertised
###Summary before removed stopwords###
not as advertised
从文本中删除单词本质上会改变矢量表示,我假设您的摘要应用程序正在使用它。最好的办法是创建自己的自定义停用词列表。另请记住,对于某些文本,意义的变化是不希望的,但这些可能是异常值!
dataframe我尝试从数据框中的两列中删除停用词(英语)。见截图。但是,我发现应用了这个流程之后,review的意义就变了。例如不推荐改为推荐。在保持原始文本的想法不变的情况下删除停用词的最佳方法是什么?这是我的代码和结果:
from nltk import word_tokenize
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
df['Text_after_removed_stopwords'] = df['Text'].apply(lambda x: '
'.join([word for word in x.split() if word not in (stop)]))
print()
print('###Text after removed
stopwords###'+'\n'+df['Text_after_removed_stopwords'][1])
print()
print('###Text before removed stopwords###'+'\n'+ df['Text'][1])
print()
df['Summary_after_removed_stopwords'] = df['Summary'].apply(lambda
x: ' '.join([word for word in x.split() if word not in (stop)]))
print('###Summary after removed stopwords###'+ '
\n'+df['Summary_after_removed_stopwords'][1])
print()
print('###Summary before removed stopwords###'+'\n'+df['Summary'][
1])
###Text after removed stopwords###
product arrived labeled jumbo salted peanutsthe peanuts actually
small sized unsalted sure error vendor intended represent product
jumbo
###Text before removed stopwords###
product arrived labeled as jumbo salted peanutsthe peanuts were
actually small sized unsalted not sure if this was an error or if
the vendor intended to represent the product as jumbo
###Summary after removed stopwords###
advertised
###Summary before removed stopwords###
not as advertised
从文本中删除单词本质上会改变矢量表示,我假设您的摘要应用程序正在使用它。最好的办法是创建自己的自定义停用词列表。另请记住,对于某些文本,意义的变化是不希望的,但这些可能是异常值!