如何根据另一列中的值计算文本列中的单词出现次数?

How to count word occurrences in a text column against values in another column?

我有一个数据框,其中包含一列文本和描述文本分数的另一列(分数)和提供标签 'b' 或 'h' 根据各自的分数。文本是可能包含也可能不包含我要查找的关键术语的句子。我想计算 'text' 列的行数,其中包含 'score_label' 列中每个值的关键术语列表,即 'b' 和 'h' 分别地。 我正在尝试修改以下代码,以便它根据 score_label:

提供值计数
df['text'].str.lower().str.contains('key').value_counts(normalize=True)

这是一个示例数据框:

df = pd.DataFrame({'id': [10, 46, 75, 12, 99, 84],
                   'text': ['John passed the course',         
                            'The highest score was Annas',
                            '',
                            'The grades are all up.',
                            'Annas score was higher than johns',
                            'Paul did just fine.'],
                   'score': [0.2, 4.3, 6.3, 1.2, 0.9, 5.4],
                   'score_label': ['h', 'h', 'b', 'h', 'h', 'b']
                                   })

我尝试了以下代码,但它不起作用:

key = ['john', 'Anna']
df['text'].apply(lambda x: df['text'].str.lower().str.contains('key').value_counts() for x in df['score_label'])

我也试过以下循环:

def term_count(terms):
    print(df_btw_all['text'].str.lower().str.contains(terms).sum()   
key = ['john', 'anna']
for k in key:
    if df.loc[df['score_label']=='b']:
        term_count(k)

但它抛出一个 ValueError: The truth value of a DataFrame is ambiguous。使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()。

如果有人能提出修复建议,我将不胜感激。

我不太确定你试图实现什么,也没有完全修复代码。请重写问题以便更容易理解。

我修复了您代码中的一些一般性问题(缺少括号、错误的变量)但它还不能执行。

还在您的 if 语句中添加了 all。如果只能处理单个布尔值。 df.loc[df['score_label']=='b' returns 布尔数组。

import pandas as pd
df = pd.DataFrame({'text': [['John passed the course'],         
                            ['The highest score was Annas'],
                            [],
                            ['The grades are all up.'],
                            ['Annas score was higher than johns'],
                            ['Paul did just fine.']],
                   'score': [0.2, 4.3, 6.3, 1.2, 0.9, 5.4],
                   'score_label': [['h'], ['h'], ['b'], ['h'], ['h'], ['b']]
                                   })
def term_count(terms):
    print(df['text'].str.lower().str.contains(terms).sum())
    
key = ['john', 'anna']
for k in key:
    if all(df.loc[df['score_label']=='b']):
        term_count(k)

I want to count the number of rows of the column 'text' that contain my list of key terms for each of the values in the 'score_label' column, i.e., for 'b' and 'h' separately.

您是否正在寻找这样的东西:

keys = ['john', 'Anna']
pattern = r"(?i)" + "|".join(keys)
result = df["text"].str.contains(pattern).groupby(df["score_label"]).sum()

结果:

score_label
b    0
h    3
Name: text, dtype: int64

或(与pattern相同):

df["match"] = df["text"].str.contains(pattern)
result = df.groupby("score_label")[["match"]].sum()

结果:

             match
score_label       
b                0
h                3