无法从数据框中删除英语停用词
Unable to remove english stopwords from a dataframe
我一直在尝试对电影评论数据集进行情感分析,但我无法从数据中删除英语停用词。我做错了什么?
from nltk.corpus import stopwords
stop = stopwords.words("English")
list_ = []
for file_ in dataset:
dataset['Content'] = dataset['Content'].apply(lambda x: [item for item in x.split(',') if item not in stop])
list_.append(dataset)
dataset = pd.concat(list_, ignore_index=True)
嗯,通过你的评论,我认为你不需要循环 dataset
。 (也许 dataset
仅包含名为 Content
的单个列)
你可以简单地做:
dataset["Content"] = dataset["Content"].str.split(",").apply(lambda x: [item for item in x if item not in stop])
您正在遍历数据集,但每次都附加整个帧而不使用文件_尝试:
from nltk.corpus import stopwords
stop = stopwords.words("English")
dataset['Cleaned'] = dataset['Content'].apply(lambda x: ','.join([item for item in x.split(',') if item not in stop]))
那 returns 一个包含单词列表的系列,如果您想将其展平为单个列表:
flat_list = [item for sublist in list(dataset['Cleaned'].values) for item in sublist]
向Making a flat list out of list of lists in Python
致敬
尝试 earthy
:
>>> from earthy.wordlist import punctuations, stopwords
>>> from earthy.preprocessing import remove_stopwords
>>> result = dataset['Content'].apply(remove_stopwords)
见https://github.com/alvations/earthy/blob/master/FAQ.md#what-else-can-earthy-do
我认为代码应该可以处理目前为止的信息。我所做的假设是数据有额外的 space,同时用逗号分隔。下面是测试运行:(希望对你有帮助!)
import pandas as pd
from nltk.corpus import stopwords
import nltk
stop = nltk.corpus.stopwords.words('english')
dataset = pd.DataFrame([{'Content':'i, am, the, computer, machine'}])
dataset = dataset.append({'Content':'i, play, game'}, ignore_index=True)
print(dataset)
list_ = []
for file_ in dataset:
dataset['Content'] = dataset['Content'].apply(lambda x: [item.strip() for item in x.split(',') if item.strip() not in stop])
list_.append(dataset)
dataset = pd.concat(list_, ignore_index=True)
print(dataset)
输入停用词:
Content
0 i, am, the, computer, machine
1 i, play, game
输出:
Content
0 [computer, machine]
1 [play, game]
我一直在尝试对电影评论数据集进行情感分析,但我无法从数据中删除英语停用词。我做错了什么?
from nltk.corpus import stopwords
stop = stopwords.words("English")
list_ = []
for file_ in dataset:
dataset['Content'] = dataset['Content'].apply(lambda x: [item for item in x.split(',') if item not in stop])
list_.append(dataset)
dataset = pd.concat(list_, ignore_index=True)
嗯,通过你的评论,我认为你不需要循环 dataset
。 (也许 dataset
仅包含名为 Content
的单个列)
你可以简单地做:
dataset["Content"] = dataset["Content"].str.split(",").apply(lambda x: [item for item in x if item not in stop])
您正在遍历数据集,但每次都附加整个帧而不使用文件_尝试:
from nltk.corpus import stopwords
stop = stopwords.words("English")
dataset['Cleaned'] = dataset['Content'].apply(lambda x: ','.join([item for item in x.split(',') if item not in stop]))
那 returns 一个包含单词列表的系列,如果您想将其展平为单个列表:
flat_list = [item for sublist in list(dataset['Cleaned'].values) for item in sublist]
向Making a flat list out of list of lists in Python
致敬尝试 earthy
:
>>> from earthy.wordlist import punctuations, stopwords
>>> from earthy.preprocessing import remove_stopwords
>>> result = dataset['Content'].apply(remove_stopwords)
见https://github.com/alvations/earthy/blob/master/FAQ.md#what-else-can-earthy-do
我认为代码应该可以处理目前为止的信息。我所做的假设是数据有额外的 space,同时用逗号分隔。下面是测试运行:(希望对你有帮助!)
import pandas as pd
from nltk.corpus import stopwords
import nltk
stop = nltk.corpus.stopwords.words('english')
dataset = pd.DataFrame([{'Content':'i, am, the, computer, machine'}])
dataset = dataset.append({'Content':'i, play, game'}, ignore_index=True)
print(dataset)
list_ = []
for file_ in dataset:
dataset['Content'] = dataset['Content'].apply(lambda x: [item.strip() for item in x.split(',') if item.strip() not in stop])
list_.append(dataset)
dataset = pd.concat(list_, ignore_index=True)
print(dataset)
输入停用词:
Content
0 i, am, the, computer, machine
1 i, play, game
输出:
Content
0 [computer, machine]
1 [play, game]