pandas 根据多个条件遍历两个数据帧
pandas iterate through two dataframes based on multiple conditions
我通过 pandas read_csv.
加载了以下两个数据帧
编辑:我添加了 DF 构造函数以便于使用
import pandas as pd
import numpy as np
df1 = pd.DataFrame(columns=['var_id', 'functions', 'num_functions', 'influence'])
df2 = pd.DataFrame(columns=['var_id', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24], index=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
influence_factor = 10
# df1 construct
df1['functions'] = [[1, 2, 5, 11, 12, 16, 17],[10, 11],[11, 19],[19],[11],[2],[11, 19],[19],[19],[11, 19],[11, 19]]
df1['var_id'] = ['AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC004','AA_ABC004','AA_ABC003']
df1['num_functions'] = df1.functions.map(len)
df1['influence'] = (influence_factor / df1['num_functions']).round(decimals=2)
# df2 construct
df2['var_id'] = ['AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC004','AA_ABC004','AA_ABC003']
df2 = df2.fillna(1.0)
# Split of function list to have easier handling of list in columns of lists
function_split = df1.functions.apply(pd.Series)
function_split = function_split.dropna(how='all')
df1 = df1.join(function_split)
df1_col = ['var_id', 'functions', 'num_functions', 'influence', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
df1 = df1.reindex(columns=df1_col)
## Remove NaN
df1 = df1.fillna('')
df1
var_id functions num_functions influence 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 AA_ABC006 [1, 2, 5, 11, 12, 16, 17] 7 1.429 1 2 5 7 9
1 AA_ABC006 [10, 11] 2 5.000 4 8
2 AA_ABC006 [11, 19] 2 5.000 1 2
3 AA_ABC006 [19] 1 10.00 9
4 AA_ABC005 [11] 1 10.00 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1964 XX_ABC004 [2, 11, 20] 3 3.333 2 11 20
1965 XX_ABC003 [19] 1 10.000 19
1966 XX_ABC003 [2, 11, 20] 3 3.333 2 11 20
1967 XX_ABC004 [2, 11, 20] 3 3.333 2 11 20
1968 XX_ABC003 [2, 11, 20] 3 3.333 2 11 20
df2(其中 header 数字等于 df1 中的函数编号)
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 24
AA_ABC006 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
AA_ABC005 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
XX_ABC004 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
XX_ABC003 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
我想要实现的是拥有一个迭代两个 DF 并比较函数匹配的函数。因此,对于 DF1 var_id AA_ABC006 函数 0、1、5 等,它应该在 DF2 的第 1、2、5 列等中占 1.0 的 1.429%,并将其应用于所有 rows/columns
此示例的预期结果是
DF_result
0
1
2
3
4
5
6
7
8
9
…
15
16
17
18
19
20
21
22
23
24
AA_ABC006
1,00
0,99
0,99
1,00
1,00
0,99
1,00
1,00
1,00
1,00
…
1,00
0,99
0,99
1,00
0,85
1,00
1,00
1,00
1,00
1,00
AA_ABC005
1,00
1,00
1,00
1,00
1,00
1,00
1,00
1,00
1,00
1,00
…
1,00
1,00
1,00
1,00
1,00
1,00
1,00
1,00
1,00
1,00
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
XX_ABC004
1,00
1,00
0,93
1,00
1,00
1,00
1,00
1,00
1,00
1,00
…
1,00
1,00
1,00
1,00
0,95
0,93
1,00
1,00
1,00
1,00
XX_ABC003
1,00
1,00
0,93
1,00
1,00
1,00
1,00
1,00
1,00
1,00
…
1,00
1,00
1,00
1,00
1,00
0,93
1,00
1,00
1,00
1,00
df2
var_id
的值应该是唯一的 iiuc。我更改了此 df 的构造函数。 (var_id
最终成为索引)
您可以使用 concat
:
为要从 df1
行中减去的值构建数据框
import pandas as pd
import numpy as np
df1 = pd.DataFrame(columns=['var_id', 'functions', 'num_functions', 'influence'])
influence_factor = 10
# df1 construct
df1['functions'] = [[1, 2, 5, 11, 12, 16, 17],[10, 11],[11, 19],[19],[11],[2],[11, 19],[19],[19],[11, 19],[11, 19]]
df1['var_id'] = ['AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC004','AA_ABC004','AA_ABC003']
df1['num_functions'] = df1.functions.map(len)
df1['influence'] = (influence_factor / df1['num_functions']).round(decimals=2)
# df2 construct
ids = df1['var_id'].unique()
df2 = pd.DataFrame(np.ones((len(ids), 25)), index=ids)
df_tmp = pd.concat([
pd.DataFrame([row['influence']*np.ones(row['num_functions'])],
columns=row['functions'], index=[row['var_id']])
for _, row in df1.iterrows()
]
)
print((df2 - df_tmp.groupby(level=0).sum()/100).fillna(1.0).round(decimals=2))
输出:
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 24
AA_ABC003 1.0 1.00 1.00 1.0 1.0 1.00 1.0 1.0 1.0 1.0 ... 1.0 1.00 1.00 1.0 0.95 1.0 1.0 1.0 1.0 1.0
AA_ABC004 1.0 1.00 1.00 1.0 1.0 1.00 1.0 1.0 1.0 1.0 ... 1.0 1.00 1.00 1.0 0.85 1.0 1.0 1.0 1.0 1.0
AA_ABC005 1.0 1.00 0.90 1.0 1.0 1.00 1.0 1.0 1.0 1.0 ... 1.0 1.00 1.00 1.0 0.85 1.0 1.0 1.0 1.0 1.0
AA_ABC006 1.0 0.99 0.99 1.0 1.0 0.99 1.0 1.0 1.0 1.0 ... 1.0 0.99 0.99 1.0 0.85 1.0 1.0 1.0 1.0 1.0
我通过 pandas read_csv.
加载了以下两个数据帧编辑:我添加了 DF 构造函数以便于使用
import pandas as pd
import numpy as np
df1 = pd.DataFrame(columns=['var_id', 'functions', 'num_functions', 'influence'])
df2 = pd.DataFrame(columns=['var_id', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24], index=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
influence_factor = 10
# df1 construct
df1['functions'] = [[1, 2, 5, 11, 12, 16, 17],[10, 11],[11, 19],[19],[11],[2],[11, 19],[19],[19],[11, 19],[11, 19]]
df1['var_id'] = ['AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC004','AA_ABC004','AA_ABC003']
df1['num_functions'] = df1.functions.map(len)
df1['influence'] = (influence_factor / df1['num_functions']).round(decimals=2)
# df2 construct
df2['var_id'] = ['AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC004','AA_ABC004','AA_ABC003']
df2 = df2.fillna(1.0)
# Split of function list to have easier handling of list in columns of lists
function_split = df1.functions.apply(pd.Series)
function_split = function_split.dropna(how='all')
df1 = df1.join(function_split)
df1_col = ['var_id', 'functions', 'num_functions', 'influence', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
df1 = df1.reindex(columns=df1_col)
## Remove NaN
df1 = df1.fillna('')
df1
var_id functions num_functions influence 0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 AA_ABC006 [1, 2, 5, 11, 12, 16, 17] 7 1.429 1 2 5 7 9
1 AA_ABC006 [10, 11] 2 5.000 4 8
2 AA_ABC006 [11, 19] 2 5.000 1 2
3 AA_ABC006 [19] 1 10.00 9
4 AA_ABC005 [11] 1 10.00 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1964 XX_ABC004 [2, 11, 20] 3 3.333 2 11 20
1965 XX_ABC003 [19] 1 10.000 19
1966 XX_ABC003 [2, 11, 20] 3 3.333 2 11 20
1967 XX_ABC004 [2, 11, 20] 3 3.333 2 11 20
1968 XX_ABC003 [2, 11, 20] 3 3.333 2 11 20
df2(其中 header 数字等于 df1 中的函数编号)
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 24
AA_ABC006 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
AA_ABC005 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
XX_ABC004 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
XX_ABC003 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
我想要实现的是拥有一个迭代两个 DF 并比较函数匹配的函数。因此,对于 DF1 var_id AA_ABC006 函数 0、1、5 等,它应该在 DF2 的第 1、2、5 列等中占 1.0 的 1.429%,并将其应用于所有 rows/columns
此示例的预期结果是
DF_result
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | … | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AA_ABC006 | 1,00 | 0,99 | 0,99 | 1,00 | 1,00 | 0,99 | 1,00 | 1,00 | 1,00 | 1,00 | … | 1,00 | 0,99 | 0,99 | 1,00 | 0,85 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 |
AA_ABC005 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | … | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
XX_ABC004 | 1,00 | 1,00 | 0,93 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | … | 1,00 | 1,00 | 1,00 | 1,00 | 0,95 | 0,93 | 1,00 | 1,00 | 1,00 | 1,00 |
XX_ABC003 | 1,00 | 1,00 | 0,93 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | … | 1,00 | 1,00 | 1,00 | 1,00 | 1,00 | 0,93 | 1,00 | 1,00 | 1,00 | 1,00 |
df2
var_id
的值应该是唯一的 iiuc。我更改了此 df 的构造函数。 (var_id
最终成为索引)
您可以使用 concat
:
df1
行中减去的值构建数据框
import pandas as pd
import numpy as np
df1 = pd.DataFrame(columns=['var_id', 'functions', 'num_functions', 'influence'])
influence_factor = 10
# df1 construct
df1['functions'] = [[1, 2, 5, 11, 12, 16, 17],[10, 11],[11, 19],[19],[11],[2],[11, 19],[19],[19],[11, 19],[11, 19]]
df1['var_id'] = ['AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC006','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC005','AA_ABC004','AA_ABC004','AA_ABC003']
df1['num_functions'] = df1.functions.map(len)
df1['influence'] = (influence_factor / df1['num_functions']).round(decimals=2)
# df2 construct
ids = df1['var_id'].unique()
df2 = pd.DataFrame(np.ones((len(ids), 25)), index=ids)
df_tmp = pd.concat([
pd.DataFrame([row['influence']*np.ones(row['num_functions'])],
columns=row['functions'], index=[row['var_id']])
for _, row in df1.iterrows()
]
)
print((df2 - df_tmp.groupby(level=0).sum()/100).fillna(1.0).round(decimals=2))
输出:
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 24
AA_ABC003 1.0 1.00 1.00 1.0 1.0 1.00 1.0 1.0 1.0 1.0 ... 1.0 1.00 1.00 1.0 0.95 1.0 1.0 1.0 1.0 1.0
AA_ABC004 1.0 1.00 1.00 1.0 1.0 1.00 1.0 1.0 1.0 1.0 ... 1.0 1.00 1.00 1.0 0.85 1.0 1.0 1.0 1.0 1.0
AA_ABC005 1.0 1.00 0.90 1.0 1.0 1.00 1.0 1.0 1.0 1.0 ... 1.0 1.00 1.00 1.0 0.85 1.0 1.0 1.0 1.0 1.0
AA_ABC006 1.0 0.99 0.99 1.0 1.0 0.99 1.0 1.0 1.0 1.0 ... 1.0 0.99 0.99 1.0 0.85 1.0 1.0 1.0 1.0 1.0