更快的 DF 标准化方法
Faster method of standardizing DF
我有一个包含大约 3000 个变量和 14000 个数据点的 df。
我需要在组内和 df 内对 df 进行标准化,总共创建 6000 个变量。
我当前的实现如下:
col_names = df.columns.to_list()
col_names.remove('id')
for col in col_name_test:
df[col + '_id'] = df.groupby('id')[col].transform(lambda x: (x - x.mean())/x.std())
df[col] = (df[col] - df[col].mean())/ df[col].std()
以上代码需要永远运行。
分别计算这两个操作的平均速度表明 groupby-transform 明显更慢。
这是一个简单的示例 df 和所需的输出。
dic = {'id': [1,1,1, 2,2,2, 3,3,3,3,3, 4,4,4,4,4 ,5,5,5,5,], 'a': [3,4,2,5,6,7,5,4,3,5,7,5,2,4,8,6,2,3,4,6], 'b': [12,32,21,14,52,62,12,34,52,74,2,34,54,12,45,75,54,23,12,32]}
df = pd.DataFrame(dic)
col_names = df.columns.to_list()
col_names.remove('id')
for col in col_names:
df[col+'_id'] = df.groupby('id')[col].transform(lambda x: (x-x.mean())/x.std())
df[col] = (df[col] - df[col].mean())/ df[col].std()
id a b a_id b_id
0 1 -0.879967 -1.060367 0.000000 -0.965060
1 1 -0.312247 -0.154070 1.000000 1.031615
2 1 -1.447688 -0.652533 -1.000000 -0.066556
3 2 0.255474 -0.969737 -1.000000 -1.131971
4 2 0.823195 0.752226 0.000000 0.368549
5 2 1.390916 1.205374 1.000000 0.763422
6 3 0.255474 -1.060367 0.134840 -0.778742
7 3 -0.312247 -0.063441 -0.539360 -0.027324
8 3 -0.879967 0.752226 -1.213560 0.587472
9 3 0.255474 1.749152 0.134840 1.338890
10 3 1.390916 -1.513515 1.483240 -1.120296
11 4 0.255474 -0.063441 0.000000 -0.427765
12 4 -1.447688 0.842856 -1.341641 0.427765
13 4 -0.312247 -1.060367 -0.447214 -1.368847
14 4 1.958637 0.435022 1.341641 0.042776
15 4 0.823195 1.794467 0.447214 1.326070
16 5 -1.447688 0.842856 -1.024695 1.332707
17 5 -0.879967 -0.561904 -0.439155 -0.406826
18 5 -0.312247 -1.060367 0.146385 -1.024080
19 5 0.823195 -0.154070 1.317465 0.098199
一起尝试 set_index
and math operations to normalize the frame, and groupby transform
+ add_suffix
to normalize groups then concat
:
new_df = df.set_index('id')
new_df = pd.concat((
(new_df - new_df.mean()) / new_df.std(),
new_df.groupby(level=0).transform(lambda x: (x - x.mean()) / x.std())
.add_suffix('_id')
), axis=1).reset_index()
new_df
:
id a b a_id b_id
0 1 -0.879967 -1.060367 0.000000 -0.965060
1 1 -0.312247 -0.154070 1.000000 1.031615
2 1 -1.447688 -0.652533 -1.000000 -0.066556
3 2 0.255474 -0.969737 -1.000000 -1.131971
4 2 0.823195 0.752226 0.000000 0.368549
5 2 1.390916 1.205374 1.000000 0.763422
6 3 0.255474 -1.060367 0.134840 -0.778742
7 3 -0.312247 -0.063441 -0.539360 -0.027324
8 3 -0.879967 0.752226 -1.213560 0.587472
9 3 0.255474 1.749152 0.134840 1.338890
10 3 1.390916 -1.513515 1.483240 -1.120296
11 4 0.255474 -0.063441 0.000000 -0.427765
12 4 -1.447688 0.842856 -1.341641 0.427765
13 4 -0.312247 -1.060367 -0.447214 -1.368847
14 4 1.958637 0.435022 1.341641 0.042776
15 4 0.823195 1.794467 0.447214 1.326070
16 5 -1.447688 0.842856 -1.024695 1.332707
17 5 -0.879967 -0.561904 -0.439155 -0.406826
18 5 -0.312247 -1.060367 0.146385 -1.024080
19 5 0.823195 -0.154070 1.317465 0.098199
尝试不使用 for 循环:
df[[x+'_id' for x in col_names]]=df.groupby('id')[col_names].transform(lambda x: (x - x.mean())/x.std())
df[col_names] = (df[col_names] - df[col_names].mean())/ df[col_names].std()
df 的输出:
id a b a_id b_id
0 1 -0.879967 -1.060367 0.000000 -0.965060
1 1 -0.312247 -0.154070 1.000000 1.031615
2 1 -1.447688 -0.652533 -1.000000 -0.066556
3 2 0.255474 -0.969737 -1.000000 -1.131971
4 2 0.823195 0.752226 0.000000 0.368549
5 2 1.390916 1.205374 1.000000 0.763422
6 3 0.255474 -1.060367 0.134840 -0.778742
7 3 -0.312247 -0.063441 -0.539360 -0.027324
8 3 -0.879967 0.752226 -1.213560 0.587472
9 3 0.255474 1.749152 0.134840 1.338890
10 3 1.390916 -1.513515 1.483240 -1.120296
11 4 0.255474 -0.063441 0.000000 -0.427765
12 4 -1.447688 0.842856 -1.341641 0.427765
13 4 -0.312247 -1.060367 -0.447214 -1.368847
14 4 1.958637 0.435022 1.341641 0.042776
15 4 0.823195 1.794467 0.447214 1.326070
16 5 -1.447688 0.842856 -1.024695 1.332707
17 5 -0.879967 -0.561904 -0.439155 -0.406826
18 5 -0.312247 -1.060367 0.146385 -1.024080
19 5 0.823195 -0.154070 1.317465 0.098199
df.sub 也接受级别参数。考虑到同样的,我们也可以尝试以下方法:
g = df.groupby("id")[col_names]
u = df.set_index("id")[col_names].sub(g.mean(),level=0).div(g.std())
out = ((df[col_names]-df[col_names].mean()).div(df[col_names].std())
.assign(**u.add_suffix("_id").reset_index()))
print(out)
a b id a_id b_id
0 -0.879967 -1.060367 1 0.000000 -0.965060
1 -0.312247 -0.154070 1 1.000000 1.031615
2 -1.447688 -0.652533 1 -1.000000 -0.066556
3 0.255474 -0.969737 2 -1.000000 -1.131971
4 0.823195 0.752226 2 0.000000 0.368549
5 1.390916 1.205374 2 1.000000 0.763422
6 0.255474 -1.060367 3 0.134840 -0.778742
7 -0.312247 -0.063441 3 -0.539360 -0.027324
8 -0.879967 0.752226 3 -1.213560 0.587472
9 0.255474 1.749152 3 0.134840 1.338890
10 1.390916 -1.513515 3 1.483240 -1.120296
11 0.255474 -0.063441 4 0.000000 -0.427765
12 -1.447688 0.842856 4 -1.341641 0.427765
13 -0.312247 -1.060367 4 -0.447214 -1.368847
14 1.958637 0.435022 4 1.341641 0.042776
15 0.823195 1.794467 4 0.447214 1.326070
16 -1.447688 0.842856 5 -1.024695 1.332707
17 -0.879967 -0.561904 5 -0.439155 -0.406826
18 -0.312247 -1.060367 5 0.146385 -1.024080
19 0.823195 -0.154070 5 1.317465 0.098199
我有一个包含大约 3000 个变量和 14000 个数据点的 df。
我需要在组内和 df 内对 df 进行标准化,总共创建 6000 个变量。
我当前的实现如下:
col_names = df.columns.to_list()
col_names.remove('id')
for col in col_name_test:
df[col + '_id'] = df.groupby('id')[col].transform(lambda x: (x - x.mean())/x.std())
df[col] = (df[col] - df[col].mean())/ df[col].std()
以上代码需要永远运行。
分别计算这两个操作的平均速度表明 groupby-transform 明显更慢。
这是一个简单的示例 df 和所需的输出。
dic = {'id': [1,1,1, 2,2,2, 3,3,3,3,3, 4,4,4,4,4 ,5,5,5,5,], 'a': [3,4,2,5,6,7,5,4,3,5,7,5,2,4,8,6,2,3,4,6], 'b': [12,32,21,14,52,62,12,34,52,74,2,34,54,12,45,75,54,23,12,32]}
df = pd.DataFrame(dic)
col_names = df.columns.to_list()
col_names.remove('id')
for col in col_names:
df[col+'_id'] = df.groupby('id')[col].transform(lambda x: (x-x.mean())/x.std())
df[col] = (df[col] - df[col].mean())/ df[col].std()
id a b a_id b_id
0 1 -0.879967 -1.060367 0.000000 -0.965060
1 1 -0.312247 -0.154070 1.000000 1.031615
2 1 -1.447688 -0.652533 -1.000000 -0.066556
3 2 0.255474 -0.969737 -1.000000 -1.131971
4 2 0.823195 0.752226 0.000000 0.368549
5 2 1.390916 1.205374 1.000000 0.763422
6 3 0.255474 -1.060367 0.134840 -0.778742
7 3 -0.312247 -0.063441 -0.539360 -0.027324
8 3 -0.879967 0.752226 -1.213560 0.587472
9 3 0.255474 1.749152 0.134840 1.338890
10 3 1.390916 -1.513515 1.483240 -1.120296
11 4 0.255474 -0.063441 0.000000 -0.427765
12 4 -1.447688 0.842856 -1.341641 0.427765
13 4 -0.312247 -1.060367 -0.447214 -1.368847
14 4 1.958637 0.435022 1.341641 0.042776
15 4 0.823195 1.794467 0.447214 1.326070
16 5 -1.447688 0.842856 -1.024695 1.332707
17 5 -0.879967 -0.561904 -0.439155 -0.406826
18 5 -0.312247 -1.060367 0.146385 -1.024080
19 5 0.823195 -0.154070 1.317465 0.098199
一起尝试 set_index
and math operations to normalize the frame, and groupby transform
+ add_suffix
to normalize groups then concat
:
new_df = df.set_index('id')
new_df = pd.concat((
(new_df - new_df.mean()) / new_df.std(),
new_df.groupby(level=0).transform(lambda x: (x - x.mean()) / x.std())
.add_suffix('_id')
), axis=1).reset_index()
new_df
:
id a b a_id b_id
0 1 -0.879967 -1.060367 0.000000 -0.965060
1 1 -0.312247 -0.154070 1.000000 1.031615
2 1 -1.447688 -0.652533 -1.000000 -0.066556
3 2 0.255474 -0.969737 -1.000000 -1.131971
4 2 0.823195 0.752226 0.000000 0.368549
5 2 1.390916 1.205374 1.000000 0.763422
6 3 0.255474 -1.060367 0.134840 -0.778742
7 3 -0.312247 -0.063441 -0.539360 -0.027324
8 3 -0.879967 0.752226 -1.213560 0.587472
9 3 0.255474 1.749152 0.134840 1.338890
10 3 1.390916 -1.513515 1.483240 -1.120296
11 4 0.255474 -0.063441 0.000000 -0.427765
12 4 -1.447688 0.842856 -1.341641 0.427765
13 4 -0.312247 -1.060367 -0.447214 -1.368847
14 4 1.958637 0.435022 1.341641 0.042776
15 4 0.823195 1.794467 0.447214 1.326070
16 5 -1.447688 0.842856 -1.024695 1.332707
17 5 -0.879967 -0.561904 -0.439155 -0.406826
18 5 -0.312247 -1.060367 0.146385 -1.024080
19 5 0.823195 -0.154070 1.317465 0.098199
尝试不使用 for 循环:
df[[x+'_id' for x in col_names]]=df.groupby('id')[col_names].transform(lambda x: (x - x.mean())/x.std())
df[col_names] = (df[col_names] - df[col_names].mean())/ df[col_names].std()
df 的输出:
id a b a_id b_id
0 1 -0.879967 -1.060367 0.000000 -0.965060
1 1 -0.312247 -0.154070 1.000000 1.031615
2 1 -1.447688 -0.652533 -1.000000 -0.066556
3 2 0.255474 -0.969737 -1.000000 -1.131971
4 2 0.823195 0.752226 0.000000 0.368549
5 2 1.390916 1.205374 1.000000 0.763422
6 3 0.255474 -1.060367 0.134840 -0.778742
7 3 -0.312247 -0.063441 -0.539360 -0.027324
8 3 -0.879967 0.752226 -1.213560 0.587472
9 3 0.255474 1.749152 0.134840 1.338890
10 3 1.390916 -1.513515 1.483240 -1.120296
11 4 0.255474 -0.063441 0.000000 -0.427765
12 4 -1.447688 0.842856 -1.341641 0.427765
13 4 -0.312247 -1.060367 -0.447214 -1.368847
14 4 1.958637 0.435022 1.341641 0.042776
15 4 0.823195 1.794467 0.447214 1.326070
16 5 -1.447688 0.842856 -1.024695 1.332707
17 5 -0.879967 -0.561904 -0.439155 -0.406826
18 5 -0.312247 -1.060367 0.146385 -1.024080
19 5 0.823195 -0.154070 1.317465 0.098199
df.sub 也接受级别参数。考虑到同样的,我们也可以尝试以下方法:
g = df.groupby("id")[col_names]
u = df.set_index("id")[col_names].sub(g.mean(),level=0).div(g.std())
out = ((df[col_names]-df[col_names].mean()).div(df[col_names].std())
.assign(**u.add_suffix("_id").reset_index()))
print(out)
a b id a_id b_id
0 -0.879967 -1.060367 1 0.000000 -0.965060
1 -0.312247 -0.154070 1 1.000000 1.031615
2 -1.447688 -0.652533 1 -1.000000 -0.066556
3 0.255474 -0.969737 2 -1.000000 -1.131971
4 0.823195 0.752226 2 0.000000 0.368549
5 1.390916 1.205374 2 1.000000 0.763422
6 0.255474 -1.060367 3 0.134840 -0.778742
7 -0.312247 -0.063441 3 -0.539360 -0.027324
8 -0.879967 0.752226 3 -1.213560 0.587472
9 0.255474 1.749152 3 0.134840 1.338890
10 1.390916 -1.513515 3 1.483240 -1.120296
11 0.255474 -0.063441 4 0.000000 -0.427765
12 -1.447688 0.842856 4 -1.341641 0.427765
13 -0.312247 -1.060367 4 -0.447214 -1.368847
14 1.958637 0.435022 4 1.341641 0.042776
15 0.823195 1.794467 4 0.447214 1.326070
16 -1.447688 0.842856 5 -1.024695 1.332707
17 -0.879967 -0.561904 5 -0.439155 -0.406826
18 -0.312247 -1.060367 5 0.146385 -1.024080
19 0.823195 -0.154070 5 1.317465 0.098199