Pandas: 使用列值 select 来自不同列的值来填充新列

Pandas: Use column value to select the value from a different column to populate a new column

我有这个数据框调用任务:

    0_score     1_score     2_score     3_score     4_score     5_score     true_label
0   0.007512    0.264500    0.273147    0.218029    0.233726    0.003084    1
1   0.130695    0.289085    0.173402    0.144897    0.238129    0.023792    1
2   0.006896    0.130070    0.289822    0.210133    0.219567    0.143512    4
3   0.006819    0.178320    0.259109    0.041048    0.316587    0.198118    1
4   0.011121    0.058437    0.182823    0.317847    0.123521    0.306250    3

我想根据列 true_label 中的值创建一个新列。我可以这样做:

scores = ['0_score', '1_score', '2_score', '3_score', '4_score','5_score']
(quest.assign(true_label_score = lambda df_:df_[scores[1]]))

这给了我这个:


    0_score     1_score     2_score     3_score     4_score     5_score     true_label  true_label_score
0   0.007512    0.264500    0.273147    0.218029    0.233726    0.003084    1   0.264500
1   0.130695    0.289085    0.173402    0.144897    0.238129    0.023792    1   0.289085
2   0.006896    0.130070    0.289822    0.210133    0.219567    0.143512    4   0.130070
3   0.006819    0.178320    0.259109    0.041048    0.316587    0.198118    1   0.178320
4   0.011121    0.058437    0.182823    0.317847    0.123521    0.306250    3   0.058437

如何将 [scores[1]] 替换为 score[quest.true_label] 之类的内容,以便对于每一行,它都使用 true_label 列中的值来为我提供正确的列来自列表分数,以便 true_label_score 列中的值来自匹配列?索引行 2 应使用 4_scores 列中的值,索引行 4 应使用 3_scores 列中的值作为 true_label_score.

中的值

您可以使用DataFrame.apply

def label_score(row):
    col_num = int(row['true_label'])
    return row[f'{col_num}_score']

quest['true_label_score'] = quest.apply(label_score, axis=1)

如果您想要基于 scores 列表的解决方案,您可以这样做

scores = ['0_score', '1_score', '2_score', '3_score', '4_score','5_score']

def label_score(row, scores):
    col_num = int(row['true_label'])
    col_label = scores[col_num]
    return row[col_label]

quest['true_label_score'] = quest.apply(label_score, scores=scores, axis=1)

但是,假设列的顺序正确(即 0_score 是第一列,1_score 是第二列,依此类推), 正如@mozway 所建议的那样,使用 numpy 花式索引会更快。

quest['true_label_score'] = quest.to_numpy()[np.arange(len(quest)), quest['true_label']]

输出:

>>> quest 

    0_score   1_score   2_score   3_score   4_score   5_score  true_label  true_label_score
0  0.007512  0.264500  0.273147  0.218029  0.233726  0.003084           1          0.264500
1  0.130695  0.289085  0.173402  0.144897  0.238129  0.023792           1          0.289085
2  0.006896  0.130070  0.289822  0.210133  0.219567  0.143512           4          0.219567
3  0.006819  0.178320  0.259109  0.041048  0.316587  0.198118           1          0.178320
4  0.011121  0.058437  0.182823  0.317847  0.123521  0.306250           3          0.317847