Pandas

Question

我正在尝试统计我的 df 中出现次数最多的单词，按另一个列值分组：

我有一个这样的数据框：

df=pd.DataFrame({'Category':['Red','Red','Blue','Yellow','Blue'],'Text':['this is very good ','good','dont like','stop','dont like']})

这是我在文本栏中统计关键字的方式：

from collections import Counter

top_N = 100


stopwords = nltk.corpus.stopwords.words('english')
# # RegEx for stopwords
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))
# replace '|'-->' ' and drop all stopwords
words = (df.Text
           .str.lower()
           .replace([r'\|', RE_stopwords], [' ', ''], regex=True)
           .str.cat(sep=' ')
           .split()
)

# generate DF out of Counter
df_top_words = pd.DataFrame(Counter(words).most_common(top_N),
                    columns=['Word', 'Frequency']).set_index('Word')
print(df_top_words)

产生这个结果：

然而，这只会生成数据框中所有单词的列表，我所追求的是这样的：

Answer 1

您的 words 语句在整列的文本中找到您关心的词（删除停用词）。我们可以稍微更改一下以在每一行上应用替换：

df["Text"] = (
    df["Text"]
    .str.lower()
    .replace([r'\|', RE_stopwords], [' ', ''], regex=True)
    .str.strip()
    # .str.cat(sep=' ')
    .str.split()  # Previously .split()
)

导致：

  Category          Text
0      Red        [good]
1      Red        [good]
2     Blue  [dont, like]
3   Yellow        [stop]
4     Blue  [dont, like]

现在，我们可以使用 .explode 然后 .groupby 和 .size 将每个列表元素扩展到它自己的行，然后计算一个单词在文本中出现了多少次每个（原始）行：

df.explode("Text").groupby(["Category", "Text"]).size()

导致：

Category  Text
Blue      dont    2
          like    2
Red       good    2
Yellow    stop    1

现在，这与您的输出示例不匹配，因为在该示例中您没有应用原始 words 语句中的 .replace 步骤（现在用于计算 "文本”列）。如果你想要那个结果，你只需要注释掉 .replace 行（但我想这就是这个问题的重点）

Pandas - 按类别分类的关键字计数

Pandas - Keyword count by Category

python

nltk

sentiment-analysis