Pandas 按行应用函数并创建多个新列

Pandas apply row-wise a function and create multiple new columns

应用按行函数并创建多个新列的最佳方法是什么?

我有两个数据帧和一个工作代码,但它很可能不是最优的

df1(数据框有数千行和 xx 列)

sic data1 data2 data3 data4 data5
5 0.90783598 0.84722083 0.47149924 0.98724123 0.50654476
6 0.53442684 0.59730371 0.92486887 0.61531646 0.62784041
3 0.56806423 0.09619383 0.33846097 0.71878313 0.96316724
8 0.86933042 0.64965755 0.94549745 0.08866519 0.92156389
12 0.651328 0.37193774 0.9679044 0.36898991 0.15161838
6 0.24555531 0.50195983 0.79114578 0.9290596 0.10672607

df2 (column header 映射到df1中的sic-code。总共有12个sic-codes,dataframe有几千行长)

1 2 3 4 5 6 7 8 9 10 11 12
c_bar 0.4955329 0.92970292 0.68049726 0.91325006 0.55578465 0.78056519 0.53954711 0.90335326 0.93986402 0.0204794 0.51575764 0.61144255
a1_bar 0.75781444 0.81052669 0.99910449 0.62181902 0.11797144 0.40031316 0.08561665 0.35296894 0.14445697 0.93799762 0.80641802 0.31379671
a2_bar 0.41432552 0.36313911 0.13091618 0.39251953 0.66249636 0.31221897 0.15988528 0.1620938 0.55143589 0.66571044 0.68198944 0.23806947
a3_bar 0.38918855 0.83689178 0.15838139 0.39943204 0.48615188 0.06299899 0.86343819 0.47975619 0.05300611 0.15080875 0.73088725 0.3500239
a4_bar 0.47201384 0.90874121 0.50417142 0.70047698 0.24820601 0.34302454 0.4650635 0.0992668 0.55142391 0.82947194 0.28251699 0.53170308

我用下面的代码实现了我需要的结果:

ind_list = np.arange(1,13) # Create list of industries 


def c_bar(row):
    for i in ind_list:
        if row['sic'] == i:
            return mlev_mean.loc['const',i]


def a1_bar(row):
    for i in ind_list:
        if row['sic'] == i:
            return mlev_mean.loc['a1bar',i]


def a2_bar(row):
    for i in ind_list:
        if row['sic'] == i:
            return mlev_mean.loc['a2bar',i]


def a3_bar(row):
    for i in ind_list:
        if row['sic'] == i:
            return mlev_mean.loc['a3bar',i]


def a4_bar(row):
    for i in ind_list:
        if row['sic'] == i:
            return mlev_mean.loc['a4bar',i]
            
mlev_merge['c_bar'] = mlev_merge.apply(c_bar, axis=1, result_type='expand')        
mlev_merge['a1_bar'] = mlev_merge.apply(a1_bar, axis=1, result_type='expand')
mlev_merge['a2_bar'] = mlev_merge.apply(a2_bar, axis=1, result_type='expand')
mlev_merge['a3_bar'] = mlev_merge.apply(a3_bar, axis=1, result_type='expand')
mlev_merge['a4_bar'] = mlev_merge.apply(a4_bar, axis=1, result_type='expand')

输出是这样的:

sic data1 data2 data3 data4 c_bar a1_bar a2_bar a3_bar a4_bar
5 0.10316948 0.61408639 0.04042675 0.79255749 0.56357931 0.42920472 0.20701581 0.67639811 0.37778029
6 0.5730904 0.16753145 0.27835136 0.00178992 0.51793793 0.06772307 0.15084885 0.12451806 0.33114948
3 0.87710893 0.66834187 0.14286608 0.12609769 0.75873957 0.72586804 0.6081763 0.14598001 0.21557266
8 0.24565579 0.56195558 0.93316676 0.20988936 0.67404545 0.65221594 0.79758557 0.67093021 0.33400764
12 0.79703344 0.61066111 0.94602909 0.56218703 0.92384307 0.30836159 0.72521994 0.00795362 0.76348227
6 0.86604791 0.28454782 0.97229172 0.21853932 0.75650652 0.40788056 0.53233553 0.60326386 0.27399405

示例中的单元格值是随机生成的,但重点是基于 sic 代码进行映射,并将 df2 中的行作为新列添加到 df1 中。

尝试转置 df2 并对其应用转换。 转置数据框意味着将行转换为数据框的列。

df2_tr = df2.T.map(lambda col:mapFunc(col),axis=0)

然后,您可以使用 df1 = pd.concat([df1,df2],axis=1).

将 df2 的转换列与 df1 的列连接起来

为此,您需要:

  1. 转置 df2 使其列正确连接
  2. 使用 df1["sic"] 列对其进行索引以获得正确的行
  3. 使用.reset_index(drop=True)重置df2获取的行的索引,以便数据帧可以正确连接。 (这会将当前索引例如 5, 6, 3, 8, 12, 6 替换为新索引,例如 0, 1, 2, 3, 4, 5,同时保持实际值相同。这样 pandas 在连接它们时不会混淆)
  4. 连接两个数据帧

注意:我使用 a method based off of this 读取数据框,它假设 df2 的列是字符串,但是 [=21= 的 sic 列的值] 是整数。因此,我使用 .astype(str) 来使第 2 步正常工作。如果实际情况并非如此,您可能需要删除 .astype(str).

这是执行这些操作的单行代码:

merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)

这是我使用的完整代码:

from io import StringIO
import pandas as pd

df1 = pd.read_csv(StringIO("""
sic data1   data2   data3   data4   data5
5   0.90783598  0.84722083  0.47149924  0.98724123  0.50654476
6   0.53442684  0.59730371  0.92486887  0.61531646  0.62784041
3   0.56806423  0.09619383  0.33846097  0.71878313  0.96316724
8   0.86933042  0.64965755  0.94549745  0.08866519  0.92156389
12  0.651328    0.37193774  0.9679044   0.36898991  0.15161838
6   0.24555531  0.50195983  0.79114578  0.9290596   0.10672607
"""), sep="\t")
df2 = pd.read_csv(StringIO("""
    1   2   3   4   5   6   7   8   9   10  11  12
c_bar   0.4955329   0.92970292  0.68049726  0.91325006  0.55578465  0.78056519  0.53954711  0.90335326  0.93986402  0.0204794   0.51575764  0.61144255
a1_bar  0.75781444  0.81052669  0.99910449  0.62181902  0.11797144  0.40031316  0.08561665  0.35296894  0.14445697  0.93799762  0.80641802  0.31379671
a2_bar  0.41432552  0.36313911  0.13091618  0.39251953  0.66249636  0.31221897  0.15988528  0.1620938   0.55143589  0.66571044  0.68198944  0.23806947
a3_bar  0.38918855  0.83689178  0.15838139  0.39943204  0.48615188  0.06299899  0.86343819  0.47975619  0.05300611  0.15080875  0.73088725  0.3500239
a4_bar  0.47201384  0.90874121  0.50417142  0.70047698  0.24820601  0.34302454  0.4650635   0.0992668   0.55142391  0.82947194  0.28251699  0.53170308
"""), sep="\t", index_col=[0])

merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)

print(merged)

产生输出:

   sic     data1     data2     data3  ...    a1_bar    a2_bar    a3_bar    a4_bar
0    5  0.907836  0.847221  0.471499  ...  0.117971  0.662496  0.486152  0.248206
1    6  0.534427  0.597304  0.924869  ...  0.400313  0.312219  0.062999  0.343025
2    3  0.568064  0.096194  0.338461  ...  0.999104  0.130916  0.158381  0.504171
3    8  0.869330  0.649658  0.945497  ...  0.352969  0.162094  0.479756  0.099267
4   12  0.651328  0.371938  0.967904  ...  0.313797  0.238069  0.350024  0.531703
5    6  0.245555  0.501960  0.791146  ...  0.400313  0.312219  0.062999  0.343025

[6 rows x 11 columns]