使用 pandas 删除停用词

Question

我想从数据框的列中删除停用词。栏内有文字需要拆分

例如我的数据框是这样的：

ID   Text
1    eat launch with me
2    go outside have fun

我想在 text column 上应用停用词，因此应该将其拆分。

我试过这个：

for item in cached_stop_words:
    if item in df_from_each_file[['text']]:
        print(item)
        df_from_each_file['text'] = df_from_each_file['text'].replace(item, '')

所以我的输出应该是这样的：

ID   Text
1    eat launch 
2    go fun

这意味着停用词已被删除。但它不能正常工作。我也试过反之亦然，使我的数据框成为系列，然后遍历它，但我也没有用。

感谢您的帮助。

Answer 1

replace（本身）不适合这里，因为您想执行 partial 字符串替换。您想要基于正则表达式的替换。

当停用词数量可控时，一个简单的解决方案是使用 str.replace。

p = re.compile("({})".format('|'.join(map(re.escape, cached_stop_words))))
df['Text'] = df['Text'].str.lower().str.replace(p, '')

df
   ID               Text
0   1       eat launch  
1   2   outside have fun

如果性能很重要，请使用列表理解。

cached_stop_words = set(cached_stop_words)
df['Text'] = [' '.join([w for w in x.lower().split() if w not in cached_stop_words]) 
    for x in df['Text'].tolist()]

df
   ID              Text
0   1        eat launch
1   2  outside have fun

使用 pandas 删除停用词

Stopword removal with pandas

python

text

stop-words

dataframe

pandas