Pandas 按行应用函数并创建多个新列
Pandas apply row-wise a function and create multiple new columns
应用按行函数并创建多个新列的最佳方法是什么?
我有两个数据帧和一个工作代码,但它很可能不是最优的
df1(数据框有数千行和 xx 列)
sic
data1
data2
data3
data4
data5
5
0.90783598
0.84722083
0.47149924
0.98724123
0.50654476
6
0.53442684
0.59730371
0.92486887
0.61531646
0.62784041
3
0.56806423
0.09619383
0.33846097
0.71878313
0.96316724
8
0.86933042
0.64965755
0.94549745
0.08866519
0.92156389
12
0.651328
0.37193774
0.9679044
0.36898991
0.15161838
6
0.24555531
0.50195983
0.79114578
0.9290596
0.10672607
df2 (column header 映射到df1中的sic-code。总共有12个sic-codes,dataframe有几千行长)
1
2
3
4
5
6
7
8
9
10
11
12
c_bar
0.4955329
0.92970292
0.68049726
0.91325006
0.55578465
0.78056519
0.53954711
0.90335326
0.93986402
0.0204794
0.51575764
0.61144255
a1_bar
0.75781444
0.81052669
0.99910449
0.62181902
0.11797144
0.40031316
0.08561665
0.35296894
0.14445697
0.93799762
0.80641802
0.31379671
a2_bar
0.41432552
0.36313911
0.13091618
0.39251953
0.66249636
0.31221897
0.15988528
0.1620938
0.55143589
0.66571044
0.68198944
0.23806947
a3_bar
0.38918855
0.83689178
0.15838139
0.39943204
0.48615188
0.06299899
0.86343819
0.47975619
0.05300611
0.15080875
0.73088725
0.3500239
a4_bar
0.47201384
0.90874121
0.50417142
0.70047698
0.24820601
0.34302454
0.4650635
0.0992668
0.55142391
0.82947194
0.28251699
0.53170308
我用下面的代码实现了我需要的结果:
ind_list = np.arange(1,13) # Create list of industries
def c_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['const',i]
def a1_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a1bar',i]
def a2_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a2bar',i]
def a3_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a3bar',i]
def a4_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a4bar',i]
mlev_merge['c_bar'] = mlev_merge.apply(c_bar, axis=1, result_type='expand')
mlev_merge['a1_bar'] = mlev_merge.apply(a1_bar, axis=1, result_type='expand')
mlev_merge['a2_bar'] = mlev_merge.apply(a2_bar, axis=1, result_type='expand')
mlev_merge['a3_bar'] = mlev_merge.apply(a3_bar, axis=1, result_type='expand')
mlev_merge['a4_bar'] = mlev_merge.apply(a4_bar, axis=1, result_type='expand')
输出是这样的:
sic
data1
data2
data3
data4
c_bar
a1_bar
a2_bar
a3_bar
a4_bar
5
0.10316948
0.61408639
0.04042675
0.79255749
0.56357931
0.42920472
0.20701581
0.67639811
0.37778029
6
0.5730904
0.16753145
0.27835136
0.00178992
0.51793793
0.06772307
0.15084885
0.12451806
0.33114948
3
0.87710893
0.66834187
0.14286608
0.12609769
0.75873957
0.72586804
0.6081763
0.14598001
0.21557266
8
0.24565579
0.56195558
0.93316676
0.20988936
0.67404545
0.65221594
0.79758557
0.67093021
0.33400764
12
0.79703344
0.61066111
0.94602909
0.56218703
0.92384307
0.30836159
0.72521994
0.00795362
0.76348227
6
0.86604791
0.28454782
0.97229172
0.21853932
0.75650652
0.40788056
0.53233553
0.60326386
0.27399405
示例中的单元格值是随机生成的,但重点是基于 sic 代码进行映射,并将 df2 中的行作为新列添加到 df1 中。
尝试转置 df2 并对其应用转换。
转置数据框意味着将行转换为数据框的列。
df2_tr = df2.T.map(lambda col:mapFunc(col),axis=0)
然后,您可以使用 df1 = pd.concat([df1,df2],axis=1)
.
将 df2 的转换列与 df1 的列连接起来
为此,您需要:
- 转置
df2
使其列正确连接
- 使用
df1["sic"]
列对其进行索引以获得正确的行
- 使用
.reset_index(drop=True)
重置df2
获取的行的索引,以便数据帧可以正确连接。 (这会将当前索引例如 5, 6, 3, 8, 12, 6
替换为新索引,例如 0, 1, 2, 3, 4, 5
,同时保持实际值相同。这样 pandas 在连接它们时不会混淆)
- 连接两个数据帧
注意:我使用 a method based off of this 读取数据框,它假设 df2
的列是字符串,但是 [=21= 的 sic
列的值] 是整数。因此,我使用 .astype(str)
来使第 2 步正常工作。如果实际情况并非如此,您可能需要删除 .astype(str)
.
这是执行这些操作的单行代码:
merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)
这是我使用的完整代码:
from io import StringIO
import pandas as pd
df1 = pd.read_csv(StringIO("""
sic data1 data2 data3 data4 data5
5 0.90783598 0.84722083 0.47149924 0.98724123 0.50654476
6 0.53442684 0.59730371 0.92486887 0.61531646 0.62784041
3 0.56806423 0.09619383 0.33846097 0.71878313 0.96316724
8 0.86933042 0.64965755 0.94549745 0.08866519 0.92156389
12 0.651328 0.37193774 0.9679044 0.36898991 0.15161838
6 0.24555531 0.50195983 0.79114578 0.9290596 0.10672607
"""), sep="\t")
df2 = pd.read_csv(StringIO("""
1 2 3 4 5 6 7 8 9 10 11 12
c_bar 0.4955329 0.92970292 0.68049726 0.91325006 0.55578465 0.78056519 0.53954711 0.90335326 0.93986402 0.0204794 0.51575764 0.61144255
a1_bar 0.75781444 0.81052669 0.99910449 0.62181902 0.11797144 0.40031316 0.08561665 0.35296894 0.14445697 0.93799762 0.80641802 0.31379671
a2_bar 0.41432552 0.36313911 0.13091618 0.39251953 0.66249636 0.31221897 0.15988528 0.1620938 0.55143589 0.66571044 0.68198944 0.23806947
a3_bar 0.38918855 0.83689178 0.15838139 0.39943204 0.48615188 0.06299899 0.86343819 0.47975619 0.05300611 0.15080875 0.73088725 0.3500239
a4_bar 0.47201384 0.90874121 0.50417142 0.70047698 0.24820601 0.34302454 0.4650635 0.0992668 0.55142391 0.82947194 0.28251699 0.53170308
"""), sep="\t", index_col=[0])
merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)
print(merged)
产生输出:
sic data1 data2 data3 ... a1_bar a2_bar a3_bar a4_bar
0 5 0.907836 0.847221 0.471499 ... 0.117971 0.662496 0.486152 0.248206
1 6 0.534427 0.597304 0.924869 ... 0.400313 0.312219 0.062999 0.343025
2 3 0.568064 0.096194 0.338461 ... 0.999104 0.130916 0.158381 0.504171
3 8 0.869330 0.649658 0.945497 ... 0.352969 0.162094 0.479756 0.099267
4 12 0.651328 0.371938 0.967904 ... 0.313797 0.238069 0.350024 0.531703
5 6 0.245555 0.501960 0.791146 ... 0.400313 0.312219 0.062999 0.343025
[6 rows x 11 columns]
应用按行函数并创建多个新列的最佳方法是什么?
我有两个数据帧和一个工作代码,但它很可能不是最优的
df1(数据框有数千行和 xx 列)
sic | data1 | data2 | data3 | data4 | data5 |
---|---|---|---|---|---|
5 | 0.90783598 | 0.84722083 | 0.47149924 | 0.98724123 | 0.50654476 |
6 | 0.53442684 | 0.59730371 | 0.92486887 | 0.61531646 | 0.62784041 |
3 | 0.56806423 | 0.09619383 | 0.33846097 | 0.71878313 | 0.96316724 |
8 | 0.86933042 | 0.64965755 | 0.94549745 | 0.08866519 | 0.92156389 |
12 | 0.651328 | 0.37193774 | 0.9679044 | 0.36898991 | 0.15161838 |
6 | 0.24555531 | 0.50195983 | 0.79114578 | 0.9290596 | 0.10672607 |
df2 (column header 映射到df1中的sic-code。总共有12个sic-codes,dataframe有几千行长)
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
c_bar | 0.4955329 | 0.92970292 | 0.68049726 | 0.91325006 | 0.55578465 | 0.78056519 | 0.53954711 | 0.90335326 | 0.93986402 | 0.0204794 | 0.51575764 | 0.61144255 |
a1_bar | 0.75781444 | 0.81052669 | 0.99910449 | 0.62181902 | 0.11797144 | 0.40031316 | 0.08561665 | 0.35296894 | 0.14445697 | 0.93799762 | 0.80641802 | 0.31379671 |
a2_bar | 0.41432552 | 0.36313911 | 0.13091618 | 0.39251953 | 0.66249636 | 0.31221897 | 0.15988528 | 0.1620938 | 0.55143589 | 0.66571044 | 0.68198944 | 0.23806947 |
a3_bar | 0.38918855 | 0.83689178 | 0.15838139 | 0.39943204 | 0.48615188 | 0.06299899 | 0.86343819 | 0.47975619 | 0.05300611 | 0.15080875 | 0.73088725 | 0.3500239 |
a4_bar | 0.47201384 | 0.90874121 | 0.50417142 | 0.70047698 | 0.24820601 | 0.34302454 | 0.4650635 | 0.0992668 | 0.55142391 | 0.82947194 | 0.28251699 | 0.53170308 |
我用下面的代码实现了我需要的结果:
ind_list = np.arange(1,13) # Create list of industries
def c_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['const',i]
def a1_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a1bar',i]
def a2_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a2bar',i]
def a3_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a3bar',i]
def a4_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a4bar',i]
mlev_merge['c_bar'] = mlev_merge.apply(c_bar, axis=1, result_type='expand')
mlev_merge['a1_bar'] = mlev_merge.apply(a1_bar, axis=1, result_type='expand')
mlev_merge['a2_bar'] = mlev_merge.apply(a2_bar, axis=1, result_type='expand')
mlev_merge['a3_bar'] = mlev_merge.apply(a3_bar, axis=1, result_type='expand')
mlev_merge['a4_bar'] = mlev_merge.apply(a4_bar, axis=1, result_type='expand')
输出是这样的:
sic | data1 | data2 | data3 | data4 | c_bar | a1_bar | a2_bar | a3_bar | a4_bar |
---|---|---|---|---|---|---|---|---|---|
5 | 0.10316948 | 0.61408639 | 0.04042675 | 0.79255749 | 0.56357931 | 0.42920472 | 0.20701581 | 0.67639811 | 0.37778029 |
6 | 0.5730904 | 0.16753145 | 0.27835136 | 0.00178992 | 0.51793793 | 0.06772307 | 0.15084885 | 0.12451806 | 0.33114948 |
3 | 0.87710893 | 0.66834187 | 0.14286608 | 0.12609769 | 0.75873957 | 0.72586804 | 0.6081763 | 0.14598001 | 0.21557266 |
8 | 0.24565579 | 0.56195558 | 0.93316676 | 0.20988936 | 0.67404545 | 0.65221594 | 0.79758557 | 0.67093021 | 0.33400764 |
12 | 0.79703344 | 0.61066111 | 0.94602909 | 0.56218703 | 0.92384307 | 0.30836159 | 0.72521994 | 0.00795362 | 0.76348227 |
6 | 0.86604791 | 0.28454782 | 0.97229172 | 0.21853932 | 0.75650652 | 0.40788056 | 0.53233553 | 0.60326386 | 0.27399405 |
示例中的单元格值是随机生成的,但重点是基于 sic 代码进行映射,并将 df2 中的行作为新列添加到 df1 中。
尝试转置 df2 并对其应用转换。 转置数据框意味着将行转换为数据框的列。
df2_tr = df2.T.map(lambda col:mapFunc(col),axis=0)
然后,您可以使用 df1 = pd.concat([df1,df2],axis=1)
.
为此,您需要:
- 转置
df2
使其列正确连接 - 使用
df1["sic"]
列对其进行索引以获得正确的行 - 使用
.reset_index(drop=True)
重置df2
获取的行的索引,以便数据帧可以正确连接。 (这会将当前索引例如5, 6, 3, 8, 12, 6
替换为新索引,例如0, 1, 2, 3, 4, 5
,同时保持实际值相同。这样 pandas 在连接它们时不会混淆) - 连接两个数据帧
注意:我使用 a method based off of this 读取数据框,它假设 df2
的列是字符串,但是 [=21= 的 sic
列的值] 是整数。因此,我使用 .astype(str)
来使第 2 步正常工作。如果实际情况并非如此,您可能需要删除 .astype(str)
.
这是执行这些操作的单行代码:
merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)
这是我使用的完整代码:
from io import StringIO
import pandas as pd
df1 = pd.read_csv(StringIO("""
sic data1 data2 data3 data4 data5
5 0.90783598 0.84722083 0.47149924 0.98724123 0.50654476
6 0.53442684 0.59730371 0.92486887 0.61531646 0.62784041
3 0.56806423 0.09619383 0.33846097 0.71878313 0.96316724
8 0.86933042 0.64965755 0.94549745 0.08866519 0.92156389
12 0.651328 0.37193774 0.9679044 0.36898991 0.15161838
6 0.24555531 0.50195983 0.79114578 0.9290596 0.10672607
"""), sep="\t")
df2 = pd.read_csv(StringIO("""
1 2 3 4 5 6 7 8 9 10 11 12
c_bar 0.4955329 0.92970292 0.68049726 0.91325006 0.55578465 0.78056519 0.53954711 0.90335326 0.93986402 0.0204794 0.51575764 0.61144255
a1_bar 0.75781444 0.81052669 0.99910449 0.62181902 0.11797144 0.40031316 0.08561665 0.35296894 0.14445697 0.93799762 0.80641802 0.31379671
a2_bar 0.41432552 0.36313911 0.13091618 0.39251953 0.66249636 0.31221897 0.15988528 0.1620938 0.55143589 0.66571044 0.68198944 0.23806947
a3_bar 0.38918855 0.83689178 0.15838139 0.39943204 0.48615188 0.06299899 0.86343819 0.47975619 0.05300611 0.15080875 0.73088725 0.3500239
a4_bar 0.47201384 0.90874121 0.50417142 0.70047698 0.24820601 0.34302454 0.4650635 0.0992668 0.55142391 0.82947194 0.28251699 0.53170308
"""), sep="\t", index_col=[0])
merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)
print(merged)
产生输出:
sic data1 data2 data3 ... a1_bar a2_bar a3_bar a4_bar
0 5 0.907836 0.847221 0.471499 ... 0.117971 0.662496 0.486152 0.248206
1 6 0.534427 0.597304 0.924869 ... 0.400313 0.312219 0.062999 0.343025
2 3 0.568064 0.096194 0.338461 ... 0.999104 0.130916 0.158381 0.504171
3 8 0.869330 0.649658 0.945497 ... 0.352969 0.162094 0.479756 0.099267
4 12 0.651328 0.371938 0.967904 ... 0.313797 0.238069 0.350024 0.531703
5 6 0.245555 0.501960 0.791146 ... 0.400313 0.312219 0.062999 0.343025
[6 rows x 11 columns]