对 Pandas 数据框中的组进行 T 检验以获取唯一 ID
T-test for groups within a Pandas dataframe for a unique id
我有以下数据框,我正在为每个 ID 在一个月中的所有工作日和周末的所有日子之间执行 t 检验。
> +-----+------------+-----------+---------+-----------+ | id | usage_day | dow | tow | daily_avg |
> +-----+------------+-----------+---------+-----------+ | c96 | 01/09/2020 | Tuesday | week | 393.07 |
> +-----+------------+-----------+---------+-----------+ | c96 | 02/09/2020 | Wednesday | week | 10.38 |
> +-----+------------+-----------+---------+-----------+ | c96 | 03/09/2020 | Thursday | week | 429.35 |
> +-----+------------+-----------+---------+-----------+ | c96 | 04/09/2020 | Friday | week | 156.20 |
> +-----+------------+-----------+---------+-----------+ | c96 | 05/09/2020 | Saturday | weekend | 346.22 |
> +-----+------------+-----------+---------+-----------+ | c96 | 06/09/2020 | Sunday | weekend | 106.53 |
> +-----+------------+-----------+---------+-----------+ | c96 | 08/09/2020 | Tuesday | week | 194.74 |
> +-----+------------+-----------+---------+-----------+ | c96 | 10/09/2020 | Thursday | week | 66.30 |
> +-----+------------+-----------+---------+-----------+ | c96 | 17/09/2020 | Thursday | week | 163.84 |
> +-----+------------+-----------+---------+-----------+ | c96 | 18/09/2020 | Friday | week | 261.81 |
> +-----+------------+-----------+---------+-----------+ | c96 | 19/09/2020 | Saturday | weekend | 410.30 |
> +-----+------------+-----------+---------+-----------+ | c96 | 20/09/2020 | Sunday | weekend | 266.28 |
> +-----+------------+-----------+---------+-----------+ | c96 | 23/09/2020 | Wednesday | week | 346.18 |
> +-----+------------+-----------+---------+-----------+ | c96 | 24/09/2020 | Thursday | week | 20.67 |
> +-----+------------+-----------+---------+-----------+ | c96 | 25/09/2020 | Friday | week | 222.23 |
> +-----+------------+-----------+---------+-----------+ | c96 | 26/09/2020 | Saturday | weekend | 449.84 |
> +-----+------------+-----------+---------+-----------+ | c96 | 27/09/2020 | Sunday | weekend | 438.47 |
> +-----+------------+-----------+---------+-----------+ | c96 | 28/09/2020 | Monday | week | 10.44 |
> +-----+------------+-----------+---------+-----------+ | c96 | 29/09/2020 | Tuesday | week | 293.59 |
> +-----+------------+-----------+---------+-----------+ | c96 | 30/09/2020 | Wednesday | week | 194.49 |
> +-----+------------+-----------+---------+-----------+
我的脚本如下,可惜太慢了,不是pandas的处理方式。
我怎样才能更有效地做到这一点?
from scipy.stats import ttest_ind, ttest_ind_from_stats
p_val = []
stat_flag = []
all_ids = df.id.unique()
alpha = 0.05
print(len(all_ids))
for id in all_ids:
t = df[df.id == id]
group1 = t[t.tow == 'week']
group2 = t[t.tow == 'weekend']
t, p_value_ttest = ttest_ind(group1.daily_avg, group2.daily_avg, equal_var=False)
if p_value_ttest < alpha:
p_val.append(p_value_ttest)
stat_flag.append(1)
else:
p_val.append(p_value_ttest)
stat_flag.append(0)
p-val 给出每个 id 的 p 值。
我无法在没有示例数据的情况下进行基准测试,但也许您可以尝试使用 groupby 而不是 for 循环:
for id,t in df.groupby('id'):
group1 = t[t.tow == 'week']
group2 = t[t.tow == 'weekend']
t, p_value_ttest = ttest_ind(group1.daily_avg, group2.daily_avg, equal_var=False)
if p_value_ttest < alpha:
p_val.append(p_value_ttest)
stat_flag.append(1)
else:
p_val.append(p_value_ttest)
stat_flag.append(0)
数据集
根据您提供的数据集:
import io
from scipy import stats
import pandas as pd
s = """id|usage_day|dow|tow|daily_avg
c96|01/09/2020|Tuesday|week|393.07
c96|02/09/2020|Wednesday|week|10.38
c96|03/09/2020|Thursday|week|429.35
c96|04/09/2020|Friday|week|156.20
c96|05/09/2020|Saturday|weekend|346.22
c96|06/09/2020|Sunday|weekend|106.53
c96|08/09/2020|Tuesday|week|194.74
c96|10/09/2020|Thursday|week|66.30
c96|17/09/2020|Thursday|week|163.84
c96|18/09/2020|Friday|week|261.81
c96|19/09/2020|Saturday|weekend|410.30
c96|20/09/2020|Sunday|weekend|266.28
c96|23/09/2020|Wednesday|week|346.18
c96|24/09/2020|Thursday|week|20.67
c96|25/09/2020|Friday|week|222.23
c96|26/09/2020|Saturday|weekend|449.84
c96|27/09/2020|Sunday|weekend|438.47
c96|28/09/2020|Monday|week|10.44
c96|29/09/2020|Tuesday|week|293.59
c96|30/09/2020|Wednesday|week|194.49"""
df = pd.read_csv(io.StringIO(s), sep='|')
为了groupby
清楚起见,我添加了一个具有相似数据的新 id
:
df2 = df.copy()
df2['id'] = 'c97'
df = pd.concat([df, df2])
MCVE
您不必求助于任何显式循环,而是 利用 apply
method which operates on frames and also works with groupby
.
为此,我们定义了一个函数,在 DataFrame 上执行所需的测试(groupby
将为与分组键组合对应的每个子数据帧调用此方法):
def ttest(x):
g = x.groupby('tow').agg({'daily_avg': list})
r = stats.ttest_ind(g.loc['week', 'daily_avg'], g.loc['weekend', 'daily_avg'], equal_var=False)
s = {k: getattr(r, k) for k in r._fields}
return pd.Series(s)
然后在 groupby
调用之后链接 apply
就足够了:
T = df.groupby('id').apply(ttest)
结果大约是:
statistic pvalue
id
c96 -2.128753 0.059126
c97 -2.128753 0.059126
重构
一旦您了解了这种方法的强大功能,您就可以将上述代码重构为可重用的函数,例如:
def ttest(x, y):
return stats.ttest_ind(x, y, equal_var=False)
def apply_test(x, subgroup='tow', value='daily_avg', key1='week', key2='weekend', test=ttest):
g = x.groupby(subgroup).agg({value: list})
r = test(g.loc[key1, value], g.loc[key2, value])
return pd.Series({k: getattr(r, k) for k in r._fields})
T = df.groupby('id').apply(apply_test, subgroup='anotherbucket', key1='experience', key2='reference', value='threshold')
这允许您根据需要调整统计测试和 DataFrame 列。
我有以下数据框,我正在为每个 ID 在一个月中的所有工作日和周末的所有日子之间执行 t 检验。
> +-----+------------+-----------+---------+-----------+ | id | usage_day | dow | tow | daily_avg |
> +-----+------------+-----------+---------+-----------+ | c96 | 01/09/2020 | Tuesday | week | 393.07 |
> +-----+------------+-----------+---------+-----------+ | c96 | 02/09/2020 | Wednesday | week | 10.38 |
> +-----+------------+-----------+---------+-----------+ | c96 | 03/09/2020 | Thursday | week | 429.35 |
> +-----+------------+-----------+---------+-----------+ | c96 | 04/09/2020 | Friday | week | 156.20 |
> +-----+------------+-----------+---------+-----------+ | c96 | 05/09/2020 | Saturday | weekend | 346.22 |
> +-----+------------+-----------+---------+-----------+ | c96 | 06/09/2020 | Sunday | weekend | 106.53 |
> +-----+------------+-----------+---------+-----------+ | c96 | 08/09/2020 | Tuesday | week | 194.74 |
> +-----+------------+-----------+---------+-----------+ | c96 | 10/09/2020 | Thursday | week | 66.30 |
> +-----+------------+-----------+---------+-----------+ | c96 | 17/09/2020 | Thursday | week | 163.84 |
> +-----+------------+-----------+---------+-----------+ | c96 | 18/09/2020 | Friday | week | 261.81 |
> +-----+------------+-----------+---------+-----------+ | c96 | 19/09/2020 | Saturday | weekend | 410.30 |
> +-----+------------+-----------+---------+-----------+ | c96 | 20/09/2020 | Sunday | weekend | 266.28 |
> +-----+------------+-----------+---------+-----------+ | c96 | 23/09/2020 | Wednesday | week | 346.18 |
> +-----+------------+-----------+---------+-----------+ | c96 | 24/09/2020 | Thursday | week | 20.67 |
> +-----+------------+-----------+---------+-----------+ | c96 | 25/09/2020 | Friday | week | 222.23 |
> +-----+------------+-----------+---------+-----------+ | c96 | 26/09/2020 | Saturday | weekend | 449.84 |
> +-----+------------+-----------+---------+-----------+ | c96 | 27/09/2020 | Sunday | weekend | 438.47 |
> +-----+------------+-----------+---------+-----------+ | c96 | 28/09/2020 | Monday | week | 10.44 |
> +-----+------------+-----------+---------+-----------+ | c96 | 29/09/2020 | Tuesday | week | 293.59 |
> +-----+------------+-----------+---------+-----------+ | c96 | 30/09/2020 | Wednesday | week | 194.49 |
> +-----+------------+-----------+---------+-----------+
我的脚本如下,可惜太慢了,不是pandas的处理方式。 我怎样才能更有效地做到这一点?
from scipy.stats import ttest_ind, ttest_ind_from_stats
p_val = []
stat_flag = []
all_ids = df.id.unique()
alpha = 0.05
print(len(all_ids))
for id in all_ids:
t = df[df.id == id]
group1 = t[t.tow == 'week']
group2 = t[t.tow == 'weekend']
t, p_value_ttest = ttest_ind(group1.daily_avg, group2.daily_avg, equal_var=False)
if p_value_ttest < alpha:
p_val.append(p_value_ttest)
stat_flag.append(1)
else:
p_val.append(p_value_ttest)
stat_flag.append(0)
p-val 给出每个 id 的 p 值。
我无法在没有示例数据的情况下进行基准测试,但也许您可以尝试使用 groupby 而不是 for 循环:
for id,t in df.groupby('id'):
group1 = t[t.tow == 'week']
group2 = t[t.tow == 'weekend']
t, p_value_ttest = ttest_ind(group1.daily_avg, group2.daily_avg, equal_var=False)
if p_value_ttest < alpha:
p_val.append(p_value_ttest)
stat_flag.append(1)
else:
p_val.append(p_value_ttest)
stat_flag.append(0)
数据集
根据您提供的数据集:
import io
from scipy import stats
import pandas as pd
s = """id|usage_day|dow|tow|daily_avg
c96|01/09/2020|Tuesday|week|393.07
c96|02/09/2020|Wednesday|week|10.38
c96|03/09/2020|Thursday|week|429.35
c96|04/09/2020|Friday|week|156.20
c96|05/09/2020|Saturday|weekend|346.22
c96|06/09/2020|Sunday|weekend|106.53
c96|08/09/2020|Tuesday|week|194.74
c96|10/09/2020|Thursday|week|66.30
c96|17/09/2020|Thursday|week|163.84
c96|18/09/2020|Friday|week|261.81
c96|19/09/2020|Saturday|weekend|410.30
c96|20/09/2020|Sunday|weekend|266.28
c96|23/09/2020|Wednesday|week|346.18
c96|24/09/2020|Thursday|week|20.67
c96|25/09/2020|Friday|week|222.23
c96|26/09/2020|Saturday|weekend|449.84
c96|27/09/2020|Sunday|weekend|438.47
c96|28/09/2020|Monday|week|10.44
c96|29/09/2020|Tuesday|week|293.59
c96|30/09/2020|Wednesday|week|194.49"""
df = pd.read_csv(io.StringIO(s), sep='|')
为了groupby
清楚起见,我添加了一个具有相似数据的新 id
:
df2 = df.copy()
df2['id'] = 'c97'
df = pd.concat([df, df2])
MCVE
您不必求助于任何显式循环,而是 利用 apply
method which operates on frames and also works with groupby
.
为此,我们定义了一个函数,在 DataFrame 上执行所需的测试(groupby
将为与分组键组合对应的每个子数据帧调用此方法):
def ttest(x):
g = x.groupby('tow').agg({'daily_avg': list})
r = stats.ttest_ind(g.loc['week', 'daily_avg'], g.loc['weekend', 'daily_avg'], equal_var=False)
s = {k: getattr(r, k) for k in r._fields}
return pd.Series(s)
然后在 groupby
调用之后链接 apply
就足够了:
T = df.groupby('id').apply(ttest)
结果大约是:
statistic pvalue
id
c96 -2.128753 0.059126
c97 -2.128753 0.059126
重构
一旦您了解了这种方法的强大功能,您就可以将上述代码重构为可重用的函数,例如:
def ttest(x, y):
return stats.ttest_ind(x, y, equal_var=False)
def apply_test(x, subgroup='tow', value='daily_avg', key1='week', key2='weekend', test=ttest):
g = x.groupby(subgroup).agg({value: list})
r = test(g.loc[key1, value], g.loc[key2, value])
return pd.Series({k: getattr(r, k) for k in r._fields})
T = df.groupby('id').apply(apply_test, subgroup='anotherbucket', key1='experience', key2='reference', value='threshold')
这允许您根据需要调整统计测试和 DataFrame 列。