DataFrame 列上的停用词

Question

我正在清理一个 excel 文件，以便可以在 PowerBi 上显示它。我想消除特定列的停用词，这是我正在使用的代码，但它似乎有问题。我需要消除的停用词是西班牙语。

另外，我正在替换 .和，到空格来拆分列并分析信息，如果您知道更简单的方法，请告诉我。

import nlkt
from nltk.corpus import stopwords
stop = stopwords.words('spanish')
df['Producto'] = df['Producto'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

df["Producto"] = df["Producto"].str.replace(",","")
df["Producto"] = df["Producto"].str.replace(".","")

df = df["Producto"].str.split(" ", expand = True)
print (df)

Answer 1

这里有一个快速的方法。我用一些示例数据重新创建了一个数据框：

import re
import nltk
from nltk.corpus import stopwords

pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('spanish')) + r')\b\s*')
df_temp = pd.DataFrame({'Words': ["Uno", "Dos", "Tres", "Other", "los"]})
df_temp['Words'] = df_temp['Words'].map(lambda x: pattern.sub('', str(x)))

df_temp 的输出：

Words
0   Uno
1   Dos
2   Tres
3   Other
4

DataFrame 列上的停用词

Stopwords on a DataFrame Column

python

stop-words

pandas