pandas 根据多个条件遍历两个数据帧

pandas iterate through two dataframes based on multiple conditions

我通过 pandas read_csv.

加载了以下两个数据帧

编辑:我添加了 DF 构造函数以便于使用

import pandas as pd
import numpy as np

df1 = pd.DataFrame(columns=['var_id', 'functions', 'num_functions', 'influence'])
df2 = pd.DataFrame(columns=['var_id', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24], index=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

influence_factor = 10

# df1 construct
df1['functions'] = [[1, 2, 5, 11, 12, 16, 17],[10, 11],[11, 19],[19],[11],[2],[11, 19],[19],[19],[11, 19],[11, 19]]
df1['var_id'] = ['AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC004','AA_ABC004','AA_ABC003']
df1['num_functions'] = df1.functions.map(len)
df1['influence'] = (influence_factor / df1['num_functions']).round(decimals=2)

# df2 construct
df2['var_id'] = ['AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC004','AA_ABC004','AA_ABC003']
df2 = df2.fillna(1.0)

# Split of function list to have easier handling of list in columns of lists
function_split = df1.functions.apply(pd.Series)
function_split = function_split.dropna(how='all')
df1 = df1.join(function_split)
df1_col = ['var_id', 'functions', 'num_functions', 'influence',  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
df1 = df1.reindex(columns=df1_col)

## Remove NaN
df1 = df1.fillna('')

df1

        var_id      functions                   num_functions   influence   0   1   2   3   4   5   6   7   8   9   10  11  12  13
0       AA_ABC006   [1, 2, 5, 11, 12, 16, 17]   7               1.429       1   2   5   7   9                               
1       AA_ABC006   [10, 11]                    2               5.000       4   8                                               
2       AA_ABC006   [11, 19]                    2               5.000       1   2                                               
3       AA_ABC006   [19]                        1               10.00       9                                                   
4       AA_ABC005   [11]                        1               10.00       0                                                   
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1964    XX_ABC004   [2, 11, 20]                 3               3.333       2   11  20                                          
1965    XX_ABC003   [19]                        1               10.000      19                                                  
1966    XX_ABC003   [2, 11, 20]                 3               3.333       2   11  20                                          
1967    XX_ABC004   [2, 11, 20]                 3               3.333       2   11  20                                          
1968    XX_ABC003   [2, 11, 20]                 3               3.333       2   11  20                                          

df2(其中 header 数字等于 df1 中的函数编号)

            0   1   2   3   4   5   6   7   8   9   ... 15  16  17  18  19  20  21  22  23  24
AA_ABC006   1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
AA_ABC005   1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
XX_ABC004   1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
XX_ABC003   1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

我想要实现的是拥有一个迭代两个 DF 并比较函数匹配的函数。因此,对于 DF1 var_id AA_ABC006 函数 0、1、5 等,它应该在 DF2 的第 1、2、5 列等中占 1.0 的 1.429%,并将其应用于所有 rows/columns

此示例的预期结果是

DF_result

0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24
AA_ABC006 1,00 0,99 0,99 1,00 1,00 0,99 1,00 1,00 1,00 1,00 1,00 0,99 0,99 1,00 0,85 1,00 1,00 1,00 1,00 1,00
AA_ABC005 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00
XX_ABC004 1,00 1,00 0,93 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 0,95 0,93 1,00 1,00 1,00 1,00
XX_ABC003 1,00 1,00 0,93 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 0,93 1,00 1,00 1,00 1,00

df2 var_id 的值应该是唯一的 iiuc。我更改了此 df 的构造函数。 (var_id 最终成为索引)

您可以使用 concat:

为要从 df1 行中减去的值构建数据框
import pandas as pd
import numpy as np

df1 = pd.DataFrame(columns=['var_id', 'functions', 'num_functions', 'influence'])

influence_factor = 10

# df1 construct
df1['functions'] = [[1, 2, 5, 11, 12, 16, 17],[10, 11],[11, 19],[19],[11],[2],[11, 19],[19],[19],[11, 19],[11, 19]]
df1['var_id'] = ['AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC004','AA_ABC004','AA_ABC003']
df1['num_functions'] = df1.functions.map(len)
df1['influence'] = (influence_factor / df1['num_functions']).round(decimals=2)

# df2 construct
ids = df1['var_id'].unique()
df2 = pd.DataFrame(np.ones((len(ids), 25)), index=ids)

df_tmp = pd.concat([
            pd.DataFrame([row['influence']*np.ones(row['num_functions'])],
                        columns=row['functions'], index=[row['var_id']])
            for _, row in df1.iterrows()
        ]
)
print((df2 - df_tmp.groupby(level=0).sum()/100).fillna(1.0).round(decimals=2))

输出:

            0     1     2    3    4     5    6    7    8    9   ...   15    16    17   18    19   20   21   22   23   24
AA_ABC003  1.0  1.00  1.00  1.0  1.0  1.00  1.0  1.0  1.0  1.0  ...  1.0  1.00  1.00  1.0  0.95  1.0  1.0  1.0  1.0  1.0
AA_ABC004  1.0  1.00  1.00  1.0  1.0  1.00  1.0  1.0  1.0  1.0  ...  1.0  1.00  1.00  1.0  0.85  1.0  1.0  1.0  1.0  1.0
AA_ABC005  1.0  1.00  0.90  1.0  1.0  1.00  1.0  1.0  1.0  1.0  ...  1.0  1.00  1.00  1.0  0.85  1.0  1.0  1.0  1.0  1.0
AA_ABC006  1.0  0.99  0.99  1.0  1.0  0.99  1.0  1.0  1.0  1.0  ...  1.0  0.99  0.99  1.0  0.85  1.0  1.0  1.0  1.0  1.0