NLP

Question

我目前有一个工作脚本可以对来自我们数据库的数据的 (conversation_message__body) 列的词频进行简单计数。下面是工作代码示例和输出（图像）。

import pandas as pd
import numpy as np

x = df.conversation_message__body.str.split(expand=True).stack().value_counts()

y = pd.DataFrame(data=x)

y.reset_index(level=0,inplace=True)

print(y)

问题是我想从这个分析中排除很多词。据我了解，这是 NLP 中的一个常见问题。所以我改变了我的脚本，如下所示：

# Import stopwords with nltk.
from nltk.corpus import stopwords
import pandas as pd
import numpy as np

stop = stopwords.words('english')
newStopWords = ['hello','hi','hey','im','get']
stop.extend(newStopWords)

df['conversation_message__body'] = df.conversation_message__body.str.replace("[^\w\s]", "").str.lower()

df['conversation_message__body'] = df['conversation_message__body'].apply(lambda x: [item for item in x.split() if item not in stop])

x = df.conversation_message__body.str.split(expand=True).stack().value_counts()

y = pd.DataFrame(data=x)

y.reset_index(level=0,inplace=True)

print(y)

这不对我有用，returns 没有结果。即使我尝试 print(x) 查看初始转换的样子，我也只能返回 > Series([], dtype: int64)

我很确定我在这里缺少一些基础知识，但我已经为此工作了一段时间但没有运气。谁能把我推向正确的方向？

Answer 1

您的专栏需要 str，而不是单词列表。

小例子：

df = pd.DataFrame({ 'conv': 
                   ["hi im Jon. I am reaching out to schedule a meeting on Monday.", "That wouldn't be possible as I am out."]})

数据看起来像：

    conv
0   jon reaching schedule meeting monday
1   wouldnt possible

然后：

df['conv'] = df['conv'].str.replace("[^\w\s]", "").str.lower()

现在您需要在 conv 中包含字符串，您的代码给出了字符串列表。

df['conv'] = df['conv'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop]))
df['conv'].str.split(expand=True).stack().value_counts()

输出：

wouldnt     1
jon         1
possible    1
meeting     1
reaching    1
monday      1
schedule    1
dtype: int64

NLP - 去除停用词和计算词频

NLP - Removing Stop Words and Counting Word Frequency

stop-words

dataframe

python-3.x

pandas