Python 从 pandas 数据帧中删除停用词给出错误的输出

Question

我从多个文件中删除了停用词。首先，我读取每个文件并从数据框中删除停用词。之后，我将数据帧与下一个数据帧连接起来。当我打印数据框时，它会给我一个输出：

0      [I,  ,  ,  ,  , r, e,  , h,  , h,  , h, v, e, ...      
1      [D,  , u,  , e, v, e, n,  , e,  , h, e,  , u, ...     
2      [R, g, h,  , f, r,  , h, e,  , e, c, r,  , w, ...     
3      [A, f, e, r,  , c, l, l, n, g,  , n,  , p, l, ...     
4      [T, h, e, r, e,  , v, e, r, e, e, n,  ,  , n, ...

这是我的代码：

allFiles = glob.glob(ROOT_DIR + '/' + DATASET + "/*.csv")
frame = pd.DataFrame()
list_ = []
stop = stopwords.words('english') 
for file_ in allFiles:
    chunkDataframe = pd.read_csv(file_,index_col=None, header=0, chunksize=1000)
    dataframe = pd.concat(chunkDataframe, ignore_index=True)
    dataframe['Text'] = dataframe['Text'].apply(lambda x: [item for item in x if item not in stop])
    print dataframe
    list_.append(dataframe)
frame = pd.concat(list_)

请帮我优化读取多个文件的方式，去掉停用词。

Answer 1

dataframe['Text'] 包含单个字符串，而不是单词列表。因此，如果您使用 lambda x: [item for item in x if item not in stop] 对其进行迭代，您将逐个字符地对其进行迭代，并生成一个字符列表作为结果。要逐字遍历它，请将其更改为：

lambda x: [item for item in string.split(x) if item not in stop]

Python 从 pandas 数据帧中删除停用词给出错误的输出

Python remove stop words from pandas dataframe give wrong output

python

glob

python-2.7

pandas