NLP - 去除停用词和计算词频

NLP - Removing Stop Words and Counting Word Frequency

我目前有一个工作脚本可以对来自我们数据库的数据的 (conversation_message__body) 列的词频进行简单计数。下面是工作代码示例和输出(图像)。

import pandas as pd
import numpy as np

x = df.conversation_message__body.str.split(expand=True).stack().value_counts()

y = pd.DataFrame(data=x)

y.reset_index(level=0,inplace=True)

print(y)

问题是我想从这个分析中排除很多词。据我了解,这是 NLP 中的一个常见问题。所以我改变了我的脚本,如下所示:

# Import stopwords with nltk.
from nltk.corpus import stopwords
import pandas as pd
import numpy as np

stop = stopwords.words('english')
newStopWords = ['hello','hi','hey','im','get']
stop.extend(newStopWords)

df['conversation_message__body'] = df.conversation_message__body.str.replace("[^\w\s]", "").str.lower()

df['conversation_message__body'] = df['conversation_message__body'].apply(lambda x: [item for item in x.split() if item not in stop])

x = df.conversation_message__body.str.split(expand=True).stack().value_counts()

y = pd.DataFrame(data=x)

y.reset_index(level=0,inplace=True)

print(y)

对我有用,returns 没有结果。即使我尝试 print(x) 查看初始转换的样子,我也只能返回 > Series([], dtype: int64)

我很确定我在这里缺少一些基础知识,但我已经为此工作了一段时间但没有运气。谁能把我推向正确的方向?

您的专栏需要 str,而不是单词列表。

小例子:

df = pd.DataFrame({ 'conv': 
                   ["hi im Jon. I am reaching out to schedule a meeting on Monday.", "That wouldn't be possible as I am out."]})

数据看起来像:

    conv
0   jon reaching schedule meeting monday
1   wouldnt possible

然后:

df['conv'] = df['conv'].str.replace("[^\w\s]", "").str.lower()

现在您需要在 conv 中包含字符串,您的代码给出了字符串列表。

df['conv'] = df['conv'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop]))
df['conv'].str.split(expand=True).stack().value_counts()

输出:

wouldnt     1
jon         1
possible    1
meeting     1
reaching    1
monday      1
schedule    1
dtype: int64