Pandas 如何将函数应用于 groupby().first()
Pandas how to apply a function to groupby().first()
我有一个df,密码是:
df = """
ValOption RB test contrat
0 SLA 4 3 23
1 AC 5 4 12
2 SLA 5 5 23
3 AC 2 4 39
4 SLA 5 5 26
5 AC 3 4 52
6 SLA 4 3 64
0 SLA 4 3 23
1 AC 5 4 12
2 SLA 5 5 23
3 AC 2 4 39
4 SLA 5 5 26
5 AC 5 4 52
6 SLA 4 3 64
"""
df = pd.read_csv(StringIO(df.strip()), sep='\s+')
输出:
ValOption RB test contrat
0 SLA 4 3 23
1 AC 5 4 12
2 SLA 5 5 23
3 AC 2 4 39
4 SLA 5 5 26
5 AC 3 4 52
6 SLA 4 3 64
0 SLA 4 3 23
1 AC 5 4 12
2 SLA 5 5 23
3 AC 2 4 39
4 SLA 5 5 26
5 AC 5 4 52
6 SLA 4 3 64
现在我将它分组并通过以下方式获得第一个:
df_u=df.groupby(['RB','test']).first()
输出:
然后我想对每一行应用一个函数,出于某种原因我必须使用 pd.apply() 函数:
def func(row):
v1=row['RB']*3
v2=row['test']-1
return v1+v2
df_u['new_col']=df_u.apply(lambda row:func(row), axis=1)
注意:实际业务中函数很复杂很长,需要用到pd.apply()
然后我得到一个错误:
KeyError: ('RB', 'occurred at index (2, 4)')
您必须 reset_index
才能访问行 'RB' & 'test'。使用 .values
将值设置为 new_col
:
df_u['new_col'] = df_u.reset_index().apply(func, axis=1).values
print(df_u)
# Output:
ValOption contrat new_col
RB test
2 4 AC 39 9
3 4 AC 52 12
4 3 SLA 23 14
5 4 AC 12 18
5 SLA 23 19
更新
How to return the new_col to the original df?
df = df.merge(df.drop_duplicates(['RB', 'test'])
.assign(new_col=func)[['RB', 'test', 'new_col']],
on=['RB', 'test'], how='left')
# Output
ValOption RB test contrat new_col
0 SLA 4 3 23 14
1 AC 5 4 12 18
2 SLA 5 5 23 19
3 AC 2 4 39 9
4 SLA 5 5 26 19
5 AC 3 4 52 12
6 SLA 4 3 64 14
7 SLA 4 3 23 14
8 AC 5 4 12 18
9 SLA 5 5 23 19
10 AC 2 4 39 9
11 SLA 5 5 26 19
12 AC 5 4 52 18
13 SLA 4 3 64 14
更新2
The reason I drop_duplicates is for saving time,make it faster
because the length of row is 60k,if I apply to each row,it spend lots of time,instead if I drop the duplicated first,I don't need to apply to each row,I directly assign the value to the same column value row
Apply 是一个类似 for 的循环,使用矢量化:
df['new_col'] = (df['RB']*3) + (df['test']-1)
性能
对于 140,000 条记录,上述操作耗时 361 微秒:
%timeit (df1['RB']*3) + (df1['test']-1)
361 µs ± 9.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
对于14条记录(不是错误),之前的操作耗时935微秒:
%timeit df.drop_duplicates(['RB', 'test']).apply(func, axis=1)
935 µs ± 5.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
我有一个df,密码是:
df = """
ValOption RB test contrat
0 SLA 4 3 23
1 AC 5 4 12
2 SLA 5 5 23
3 AC 2 4 39
4 SLA 5 5 26
5 AC 3 4 52
6 SLA 4 3 64
0 SLA 4 3 23
1 AC 5 4 12
2 SLA 5 5 23
3 AC 2 4 39
4 SLA 5 5 26
5 AC 5 4 52
6 SLA 4 3 64
"""
df = pd.read_csv(StringIO(df.strip()), sep='\s+')
输出:
ValOption RB test contrat
0 SLA 4 3 23
1 AC 5 4 12
2 SLA 5 5 23
3 AC 2 4 39
4 SLA 5 5 26
5 AC 3 4 52
6 SLA 4 3 64
0 SLA 4 3 23
1 AC 5 4 12
2 SLA 5 5 23
3 AC 2 4 39
4 SLA 5 5 26
5 AC 5 4 52
6 SLA 4 3 64
现在我将它分组并通过以下方式获得第一个:
df_u=df.groupby(['RB','test']).first()
输出:
然后我想对每一行应用一个函数,出于某种原因我必须使用 pd.apply() 函数:
def func(row):
v1=row['RB']*3
v2=row['test']-1
return v1+v2
df_u['new_col']=df_u.apply(lambda row:func(row), axis=1)
注意:实际业务中函数很复杂很长,需要用到pd.apply()
然后我得到一个错误:
KeyError: ('RB', 'occurred at index (2, 4)')
您必须 reset_index
才能访问行 'RB' & 'test'。使用 .values
将值设置为 new_col
:
df_u['new_col'] = df_u.reset_index().apply(func, axis=1).values
print(df_u)
# Output:
ValOption contrat new_col
RB test
2 4 AC 39 9
3 4 AC 52 12
4 3 SLA 23 14
5 4 AC 12 18
5 SLA 23 19
更新
How to return the new_col to the original df?
df = df.merge(df.drop_duplicates(['RB', 'test'])
.assign(new_col=func)[['RB', 'test', 'new_col']],
on=['RB', 'test'], how='left')
# Output
ValOption RB test contrat new_col
0 SLA 4 3 23 14
1 AC 5 4 12 18
2 SLA 5 5 23 19
3 AC 2 4 39 9
4 SLA 5 5 26 19
5 AC 3 4 52 12
6 SLA 4 3 64 14
7 SLA 4 3 23 14
8 AC 5 4 12 18
9 SLA 5 5 23 19
10 AC 2 4 39 9
11 SLA 5 5 26 19
12 AC 5 4 52 18
13 SLA 4 3 64 14
更新2
The reason I drop_duplicates is for saving time,make it faster because the length of row is 60k,if I apply to each row,it spend lots of time,instead if I drop the duplicated first,I don't need to apply to each row,I directly assign the value to the same column value row
Apply 是一个类似 for 的循环,使用矢量化:
df['new_col'] = (df['RB']*3) + (df['test']-1)
性能
对于 140,000 条记录,上述操作耗时 361 微秒:
%timeit (df1['RB']*3) + (df1['test']-1)
361 µs ± 9.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
对于14条记录(不是错误),之前的操作耗时935微秒:
%timeit df.drop_duplicates(['RB', 'test']).apply(func, axis=1)
935 µs ± 5.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)