聚合函数优化
Aggregation function optimization
我有一个名为 customer_base 的数据集,包含超过 800K 行,如下所示:
ID
AGE
GENDER
OCCUPATION
1
64
101
"occ1"
2
64
100
"occ2"
2
66
100
Nan
2
Nan
100
"occ2"
3
Nan
101
"occ3"
3
Nan
Nan
Nan
3
32
Nan
Nan
.
.
.
.
经过分组操作后,所需的版本应如下所示:
ID
AGE
GENDER
OCCUPATION
1
64
101
"occ1"
2
66
100
"occ2"
3
32
101
"occ3"
.
.
.
.
之前我尝试了如下所示的代码示例以获得尽可能干净的 table,但它花费了太多时间。现在我需要一个更快的函数来获取 occupation
列的任何可用值。
customer_base.groupby("ID",
as_index=False).agg({"GENDER":"max",
"AGE":"max",
"OCCUPATION":lambda x: np.nan if len(x[x.notna()])==0 else x[x.notna()].values[0]})
在此先感谢您的优化想法,对于可能重复的问题感到抱歉
对第一个非 NaN
值使用 GroupBy.first
:
df = customer_base.groupby("ID", as_index=False).agg({"AGE":"max",
"GENDER":"max",
"OCCUPATION":'first'})
print (df)
ID AGE GENDER OCCUPATION
0 1 64.0 101.0 "occ1"
1 2 66.0 100.0 "occ2"
2 3 32.0 101.0 "occ3"
我有一个名为 customer_base 的数据集,包含超过 800K 行,如下所示:
ID | AGE | GENDER | OCCUPATION |
---|---|---|---|
1 | 64 | 101 | "occ1" |
2 | 64 | 100 | "occ2" |
2 | 66 | 100 | Nan |
2 | Nan | 100 | "occ2" |
3 | Nan | 101 | "occ3" |
3 | Nan | Nan | Nan |
3 | 32 | Nan | Nan |
. | . | . | . |
经过分组操作后,所需的版本应如下所示:
ID | AGE | GENDER | OCCUPATION |
---|---|---|---|
1 | 64 | 101 | "occ1" |
2 | 66 | 100 | "occ2" |
3 | 32 | 101 | "occ3" |
. | . | . | . |
之前我尝试了如下所示的代码示例以获得尽可能干净的 table,但它花费了太多时间。现在我需要一个更快的函数来获取 occupation
列的任何可用值。
customer_base.groupby("ID",
as_index=False).agg({"GENDER":"max",
"AGE":"max",
"OCCUPATION":lambda x: np.nan if len(x[x.notna()])==0 else x[x.notna()].values[0]})
在此先感谢您的优化想法,对于可能重复的问题感到抱歉
对第一个非 NaN
值使用 GroupBy.first
:
df = customer_base.groupby("ID", as_index=False).agg({"AGE":"max",
"GENDER":"max",
"OCCUPATION":'first'})
print (df)
ID AGE GENDER OCCUPATION
0 1 64.0 101.0 "occ1"
1 2 66.0 100.0 "occ2"
2 3 32.0 101.0 "occ3"