Pandas 合并两个分组依据,过滤并合并分组(计数)
Pandas combine two group by's, filter and merge the groups(counts)
我有一个数据框,我需要将两个不同的 groupby 与其中一个过滤组合。
ID EVENT SUCCESS
1 PUT Y
2 POST Y
2 PUT N
1 DELETE Y
下面的 table 是我希望数据的样子。首先对 'EVENT' 计数进行分组,其次是计算每个 ID
的成功数 ('Y')
ID PUT POST DELETE SUCCESS
1 1 0 1 2
2 1 1 0 1
我尝试了一些技巧,发现壁橱是两种不同的方法,它们产生以下结果
group_df = df.groupby(['ID', 'EVENT'])
count_group_df = group_df.size().unstack()
'EVENT' 计数
产生以下结果
ID PUT POST DELETE
1 1 0 1
2 1 1 0
对于带有过滤器的成功,我不知道我是否可以加入 'ID'
的第一组
df_success = df.loc[df['SUCCESS'] == 'Y', ['ID', 'SUCCESS']]
count_group_df_2 = df_success.groupby(['ID', 'SUCCESS'])
ID SUCCESS
1 2
2 1
我需要以某种方式组合这些吗?
此外,我还想将 'EVENT' 中的两个计数(例如 PUT 和 POST 合并到一列中。
使用 concat
将它们合并在一起:
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int)
df = pd.concat([df1, df_success],axis=1)
print (df)
DELETE POST PUT SUCCESS
ID
1 1 0 1 2
2 0 1 1 1
value_counts
的另一个解决方案:
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS')
df = pd.concat([df1, df_success],axis=1)
print (df)
DELETE POST PUT SUCCESS
ID
1 1 0 1 2
2 0 1 1 1
最后可以将索引转换为列并通过 reset_index
+ rename_axis
删除列名称 ID
:
df = df.reset_index().rename_axis(None, axis=1)
print (df)
ID DELETE POST PUT SUCCESS
0 1 1 0 1 2
1 2 0 1 1 1
pandas
pd.get_dummies(df.EVENT) \
.assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)) \
.groupby(df.ID).sum().reset_index()
ID DELETE POST PUT SUCCESS
0 1 1 0 1 2
1 2 0 1 1 1
numpy
和 pandas
f, u = pd.factorize(df.EVENT.values)
n = u.size
d = np.eye(n)[f]
s = (df.SUCCESS.values == 'Y').astype(int)
d1 = pd.DataFrame(
np.column_stack([d, s]),
df.index, np.append(u, 'SUCCESS')
)
d1.groupby(df.ID).sum().reset_index()
ID DELETE POST PUT SUCCESS
0 1 1 0 1 2
1 2 0 1 1 1
时机
小数据
%%timeit
f, u = pd.factorize(df.EVENT.values)
n = u.size
d = np.eye(n)[f]
s = (df.SUCCESS.values == 'Y').astype(int)
d1 = pd.DataFrame(
np.column_stack([d, s]),
df.index, np.append(u, 'SUCCESS')
)
d1.groupby(df.ID).sum().reset_index()
1000 loops, best of 3: 1.32 ms per loop
%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int)
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 3.3 ms per loop
%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS')
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 3.28 ms per loop
%timeit pd.get_dummies(df.EVENT).assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)).groupby(df.ID).sum().reset_index()
100 loops, best of 3: 2.62 ms per loop
大数据
df = pd.DataFrame(dict(
ID=np.random.randint(100, size=100000),
EVENT=np.random.choice('PUT POST DELETE'.split(), size=100000),
SUCCESS=np.random.choice(list('YN'), size=100000)
))
%%timeit
f, u = pd.factorize(df.EVENT.values)
n = u.size
d = np.eye(n)[f]
s = (df.SUCCESS.values == 'Y').astype(int)
d1 = pd.DataFrame(
np.column_stack([d, s]),
df.index, np.append(u, 'SUCCESS')
)
d1.groupby(df.ID).sum().reset_index()
100 loops, best of 3: 10.8 ms per loop
%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int)
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 17.7 ms per loop
%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS')
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 17.4 ms per loop
%timeit pd.get_dummies(df.EVENT).assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)).groupby(df.ID).sum().reset_index()
100 loops, best of 3: 16.8 ms per loop
使用 pivot_table 和数据帧过滤器
df=pd.DataFrame([{"ID":1,
"EVENT":"PUT",
"SUCCESS":"Y"
},
{
"ID":2,
"EVENT":"POST",
"SUCCESS":"Y"
}
,
{
"ID":2,
"EVENT":"PUT",
"SUCCESS":"N"
},
{
"ID":1,
"EVENT":"DELETE",
"SUCCESS":"Y"
}])
filter=df['SUCCESS']=='Y'
event= df[filter].groupby('ID')['EVENT'].size().reset_index()
print(event)
#df_t=df.T
#print(df_t)
event= df[filter].pivot_table(index='ID', columns='EVENT', values='SUCCESS', aggfunc='count',fill_value=0)
print (event)
我有一个数据框,我需要将两个不同的 groupby 与其中一个过滤组合。
ID EVENT SUCCESS
1 PUT Y
2 POST Y
2 PUT N
1 DELETE Y
下面的 table 是我希望数据的样子。首先对 'EVENT' 计数进行分组,其次是计算每个 ID
的成功数 ('Y')ID PUT POST DELETE SUCCESS
1 1 0 1 2
2 1 1 0 1
我尝试了一些技巧,发现壁橱是两种不同的方法,它们产生以下结果
group_df = df.groupby(['ID', 'EVENT'])
count_group_df = group_df.size().unstack()
'EVENT' 计数
产生以下结果ID PUT POST DELETE
1 1 0 1
2 1 1 0
对于带有过滤器的成功,我不知道我是否可以加入 'ID'
的第一组 df_success = df.loc[df['SUCCESS'] == 'Y', ['ID', 'SUCCESS']]
count_group_df_2 = df_success.groupby(['ID', 'SUCCESS'])
ID SUCCESS
1 2
2 1
我需要以某种方式组合这些吗?
此外,我还想将 'EVENT' 中的两个计数(例如 PUT 和 POST 合并到一列中。
使用 concat
将它们合并在一起:
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int)
df = pd.concat([df1, df_success],axis=1)
print (df)
DELETE POST PUT SUCCESS
ID
1 1 0 1 2
2 0 1 1 1
value_counts
的另一个解决方案:
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS')
df = pd.concat([df1, df_success],axis=1)
print (df)
DELETE POST PUT SUCCESS
ID
1 1 0 1 2
2 0 1 1 1
最后可以将索引转换为列并通过 reset_index
+ rename_axis
删除列名称 ID
:
df = df.reset_index().rename_axis(None, axis=1)
print (df)
ID DELETE POST PUT SUCCESS
0 1 1 0 1 2
1 2 0 1 1 1
pandas
pd.get_dummies(df.EVENT) \
.assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)) \
.groupby(df.ID).sum().reset_index()
ID DELETE POST PUT SUCCESS
0 1 1 0 1 2
1 2 0 1 1 1
numpy
和 pandas
f, u = pd.factorize(df.EVENT.values)
n = u.size
d = np.eye(n)[f]
s = (df.SUCCESS.values == 'Y').astype(int)
d1 = pd.DataFrame(
np.column_stack([d, s]),
df.index, np.append(u, 'SUCCESS')
)
d1.groupby(df.ID).sum().reset_index()
ID DELETE POST PUT SUCCESS
0 1 1 0 1 2
1 2 0 1 1 1
时机
小数据
%%timeit
f, u = pd.factorize(df.EVENT.values)
n = u.size
d = np.eye(n)[f]
s = (df.SUCCESS.values == 'Y').astype(int)
d1 = pd.DataFrame(
np.column_stack([d, s]),
df.index, np.append(u, 'SUCCESS')
)
d1.groupby(df.ID).sum().reset_index()
1000 loops, best of 3: 1.32 ms per loop
%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int)
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 3.3 ms per loop
%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS')
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 3.28 ms per loop
%timeit pd.get_dummies(df.EVENT).assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)).groupby(df.ID).sum().reset_index()
100 loops, best of 3: 2.62 ms per loop
大数据
df = pd.DataFrame(dict(
ID=np.random.randint(100, size=100000),
EVENT=np.random.choice('PUT POST DELETE'.split(), size=100000),
SUCCESS=np.random.choice(list('YN'), size=100000)
))
%%timeit
f, u = pd.factorize(df.EVENT.values)
n = u.size
d = np.eye(n)[f]
s = (df.SUCCESS.values == 'Y').astype(int)
d1 = pd.DataFrame(
np.column_stack([d, s]),
df.index, np.append(u, 'SUCCESS')
)
d1.groupby(df.ID).sum().reset_index()
100 loops, best of 3: 10.8 ms per loop
%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int)
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 17.7 ms per loop
%%timeit
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0)
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS')
pd.concat([df1, df_success],axis=1).reset_index()
100 loops, best of 3: 17.4 ms per loop
%timeit pd.get_dummies(df.EVENT).assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)).groupby(df.ID).sum().reset_index()
100 loops, best of 3: 16.8 ms per loop
使用 pivot_table 和数据帧过滤器
df=pd.DataFrame([{"ID":1,
"EVENT":"PUT",
"SUCCESS":"Y"
},
{
"ID":2,
"EVENT":"POST",
"SUCCESS":"Y"
}
,
{
"ID":2,
"EVENT":"PUT",
"SUCCESS":"N"
},
{
"ID":1,
"EVENT":"DELETE",
"SUCCESS":"Y"
}])
filter=df['SUCCESS']=='Y'
event= df[filter].groupby('ID')['EVENT'].size().reset_index()
print(event)
#df_t=df.T
#print(df_t)
event= df[filter].pivot_table(index='ID', columns='EVENT', values='SUCCESS', aggfunc='count',fill_value=0)
print (event)