是否有更好的矢量化解决方案来使用第二个数据框中定义为变量的索引和列写入数据框?

Is there a better vectorized solution to writing into a dataframe using the index and columns defined as variables in a second dataframe?

我有一个基本数据框,然后我想将其填充到由第二个数据框定义的某些索引和列中,我希望其行影响基本 df 中的更改。第二个数据帧的索引 df_idx 是我感兴趣的 base_df 行; df_idx 还包含要填充的开始和结束列,以及要写入的值。 base_df 看起来像这样:

import pandas as pd
import numpy as np

months = list(range(1, 13))
li = list(map(str, months))
cols = ['ID']
cols.extend(li)

df_base = pd.DataFrame(np.random.randint(0,1000,size=(5, 13)), columns=cols)
df_base.loc[[1,2],'1':'12'] = np.nan
df_base.loc[4,'7':'12'] = np.nan

    ID      1      2      3      4      5      6      7      8      9     10     11     12
0  328   45.0  226.0  388.0  286.0  557.0  930.0  234.0  418.0  863.0  500.0  232.0  116.0
1  340    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN
2  865    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN
3  293  772.0  185.0    6.0  284.0  522.0  826.0  995.0  370.0   87.0  668.0  469.0   40.0
4  313  947.0  272.0  936.0  501.0  241.0  731.0    NaN    NaN    NaN    NaN    NaN    NaN

下面的

df_idx 显示对于索引 1,将沿着 base_df 中从“1”列到“12”列的行填充 val 210:

df_idx = pd.DataFrame({'start': [1, 2, 3],
                       'end': [12, 10, 11],
                       'val':np.random.randint(0,1000,size=(1, 3))[0]},
                      index=[1,2,4])

start   end val
1   1   12  210
2   2   10  663
4   3   11  922

我可以遍历行,但我不想这样做,因为 base_df 可能 >250,000 行,像这样:

for idx, row in df_idx.iterrows():
    mntStrt = str(row['start'])
    mnthEnd = str(row['end'])
    df_base.loc[idx, mntStrt:mnthEnd] = row['val']

或者我现在倾向于什么,使用 pandas 应用函数:

def inputVals(x):
    idx = x.name
    mntStrt = str(x['start'])
    mnthEnd = str(x['end'])
    df_base.loc[idx, mntStrt:mnthEnd] = x['val']

df_idx.apply(lambda x: inputVals(x), axis=1)

生成的数据框如下所示:

    ID  1   2   3   4   5   6   7   8   9   10  11  12
0   947 537.0   827.0   477.0   39.0    586.0   370.0   576.0   556.0   119.0   158.0   990.0   958.0
1   157 129.0   129.0   129.0   129.0   129.0   129.0   129.0   129.0   129.0   129.0   129.0   129.0
2   545 NaN 849.0   849.0   849.0   849.0   849.0   849.0   849.0   849.0   849.0   NaN NaN
3   549 835.0   205.0   158.0   499.0   451.0   887.0   145.0   6.0 518.0   385.0   34.0    613.0
4   57  673.0   55.0    925.0   925.0   925.0   925.0   925.0   925.0   925.0   925.0   925.0   NaN

我觉得有更有效的方法来解决这个问题;欢迎任何见解或批评。谢谢!

一种方法是重塑 df_idx 以在索引为 df_idx 的数据框中的正确位置获取新值,并获取所需的 1 到 12 列。为此,您可以使用 numpy并将列开始和结束与 1 到 12 进行比较。乘以 val 列并根据需要设置索引列。所以

# set sedd for reproductibility with df_idx
np.random.seed(1)

tmp = \
pd.DataFrame(
    data = ((df_idx['start'].to_numpy()[:, None] <= np.arange(1,13))
             & (df_idx['end'].to_numpy()[:, None] >= np.arange(1,13)))
            *df_idx['val'].to_numpy()[:, None], 
    index=df_idx.index, 
    columns=li
).replace(0,np.nan) # 

print(tmp)
      1      2    3    4    5    6    7    8    9   10     11    12
1  37.0   37.0   37   37   37   37   37   37   37   37   37.0  37.0
2   NaN  235.0  235  235  235  235  235  235  235  235    NaN   NaN
4   NaN    NaN  908  908  908  908  908  908  908  908  908.0   NaN

现在您可以使用 update 在 df_base

中设置新值
df_base.update(tmp, overwrite=True) # no need of reassignment with update 
# set overwrite = False if you only change the nan values in df_base to be updated
print(df_base)
    ID      1      2      3      4      5      6      7      8      9     10  \
0   72  767.0  905.0  715.0  645.0  847.0  960.0  144.0  129.0  972.0  583.0   
1  390   37.0   37.0   37.0   37.0   37.0   37.0   37.0   37.0   37.0   37.0   
2  398    NaN  235.0  235.0  235.0  235.0  235.0  235.0  235.0  235.0  235.0   
3  319  829.0  534.0  313.0  513.0  896.0  316.0  209.0  264.0  728.0  653.0   
4  633  456.0  542.0  908.0  908.0  908.0  908.0  908.0  908.0  908.0  908.0   

      11     12  
0  749.0  508.0  
1   37.0   37.0  
2    NaN    NaN  
3  627.0  431.0  
4  908.0    NaN