如何根据另一列中的值计算文本列中的单词出现次数?
How to count word occurrences in a text column against values in another column?
我有一个数据框,其中包含一列文本和描述文本分数的另一列(分数)和提供标签 'b' 或 'h' 根据各自的分数。文本是可能包含也可能不包含我要查找的关键术语的句子。我想计算 'text' 列的行数,其中包含 'score_label' 列中每个值的关键术语列表,即 'b' 和 'h' 分别地。
我正在尝试修改以下代码,以便它根据 score_label:
提供值计数
df['text'].str.lower().str.contains('key').value_counts(normalize=True)
这是一个示例数据框:
df = pd.DataFrame({'id': [10, 46, 75, 12, 99, 84],
'text': ['John passed the course',
'The highest score was Annas',
'',
'The grades are all up.',
'Annas score was higher than johns',
'Paul did just fine.'],
'score': [0.2, 4.3, 6.3, 1.2, 0.9, 5.4],
'score_label': ['h', 'h', 'b', 'h', 'h', 'b']
})
我尝试了以下代码,但它不起作用:
key = ['john', 'Anna']
df['text'].apply(lambda x: df['text'].str.lower().str.contains('key').value_counts() for x in df['score_label'])
我也试过以下循环:
def term_count(terms):
print(df_btw_all['text'].str.lower().str.contains(terms).sum()
key = ['john', 'anna']
for k in key:
if df.loc[df['score_label']=='b']:
term_count(k)
但它抛出一个 ValueError: The truth value of a DataFrame is ambiguous。使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()。
如果有人能提出修复建议,我将不胜感激。
我不太确定你试图实现什么,也没有完全修复代码。请重写问题以便更容易理解。
我修复了您代码中的一些一般性问题(缺少括号、错误的变量)但它还不能执行。
还在您的 if 语句中添加了 all
。如果只能处理单个布尔值。 df.loc[df['score_label']=='b'
returns 布尔数组。
import pandas as pd
df = pd.DataFrame({'text': [['John passed the course'],
['The highest score was Annas'],
[],
['The grades are all up.'],
['Annas score was higher than johns'],
['Paul did just fine.']],
'score': [0.2, 4.3, 6.3, 1.2, 0.9, 5.4],
'score_label': [['h'], ['h'], ['b'], ['h'], ['h'], ['b']]
})
def term_count(terms):
print(df['text'].str.lower().str.contains(terms).sum())
key = ['john', 'anna']
for k in key:
if all(df.loc[df['score_label']=='b']):
term_count(k)
I want to count the number of rows of the column 'text' that contain my list of key terms for each of the values in the 'score_label' column, i.e., for 'b' and 'h' separately.
您是否正在寻找这样的东西:
keys = ['john', 'Anna']
pattern = r"(?i)" + "|".join(keys)
result = df["text"].str.contains(pattern).groupby(df["score_label"]).sum()
结果:
score_label
b 0
h 3
Name: text, dtype: int64
或(与pattern
相同):
df["match"] = df["text"].str.contains(pattern)
result = df.groupby("score_label")[["match"]].sum()
结果:
match
score_label
b 0
h 3
我有一个数据框,其中包含一列文本和描述文本分数的另一列(分数)和提供标签 'b' 或 'h' 根据各自的分数。文本是可能包含也可能不包含我要查找的关键术语的句子。我想计算 'text' 列的行数,其中包含 'score_label' 列中每个值的关键术语列表,即 'b' 和 'h' 分别地。 我正在尝试修改以下代码,以便它根据 score_label:
提供值计数df['text'].str.lower().str.contains('key').value_counts(normalize=True)
这是一个示例数据框:
df = pd.DataFrame({'id': [10, 46, 75, 12, 99, 84],
'text': ['John passed the course',
'The highest score was Annas',
'',
'The grades are all up.',
'Annas score was higher than johns',
'Paul did just fine.'],
'score': [0.2, 4.3, 6.3, 1.2, 0.9, 5.4],
'score_label': ['h', 'h', 'b', 'h', 'h', 'b']
})
我尝试了以下代码,但它不起作用:
key = ['john', 'Anna']
df['text'].apply(lambda x: df['text'].str.lower().str.contains('key').value_counts() for x in df['score_label'])
我也试过以下循环:
def term_count(terms):
print(df_btw_all['text'].str.lower().str.contains(terms).sum()
key = ['john', 'anna']
for k in key:
if df.loc[df['score_label']=='b']:
term_count(k)
但它抛出一个 ValueError: The truth value of a DataFrame is ambiguous。使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()。
如果有人能提出修复建议,我将不胜感激。
我不太确定你试图实现什么,也没有完全修复代码。请重写问题以便更容易理解。
我修复了您代码中的一些一般性问题(缺少括号、错误的变量)但它还不能执行。
还在您的 if 语句中添加了 all
。如果只能处理单个布尔值。 df.loc[df['score_label']=='b'
returns 布尔数组。
import pandas as pd
df = pd.DataFrame({'text': [['John passed the course'],
['The highest score was Annas'],
[],
['The grades are all up.'],
['Annas score was higher than johns'],
['Paul did just fine.']],
'score': [0.2, 4.3, 6.3, 1.2, 0.9, 5.4],
'score_label': [['h'], ['h'], ['b'], ['h'], ['h'], ['b']]
})
def term_count(terms):
print(df['text'].str.lower().str.contains(terms).sum())
key = ['john', 'anna']
for k in key:
if all(df.loc[df['score_label']=='b']):
term_count(k)
I want to count the number of rows of the column 'text' that contain my list of key terms for each of the values in the 'score_label' column, i.e., for 'b' and 'h' separately.
您是否正在寻找这样的东西:
keys = ['john', 'Anna']
pattern = r"(?i)" + "|".join(keys)
result = df["text"].str.contains(pattern).groupby(df["score_label"]).sum()
结果:
score_label
b 0
h 3
Name: text, dtype: int64
或(与pattern
相同):
df["match"] = df["text"].str.contains(pattern)
result = df.groupby("score_label")[["match"]].sum()
结果:
match
score_label
b 0
h 3