如何在拥有 Ground Truth 的情况下只为 Data Frame 找到 True Positive？

Question

首先，对于冗长的描述，我深表歉意，但我希望每个人都能理解我的问题。

我正在研究一个预测 14 种不同病理的检测模型，并且我制作了一个推理文件来预测任何新的测试图像。数据集有大约 25k+ 的测试图像，我已经找到了他们的预测并制作了这样的文件 Dataframe.

在这个数据框中我有（了解我的场景的小信息）：

image_name______00000003_000.png
label_____[[[0.0, 0.0, 1024.0, 1024.0], [0.0, 0.0, 1024.0, 1024.0], [119.195767195767, 339.166137566138, 470.281481481481, 511.458201058202]], ['Cardiomegaly', 'Edema', 'Infiltration']]
Bounding_Box_____True/False
Atelectasis _____0.172639399766922
Cardiomegaly _____0.064461663365364
Consolidation _____0.436323910951614
Edema _____0.152604594826698
Effusion _____0.077432356774807
Emphysema _____0.569778263568878
Fibrosis _____0.333310723304749
Hernia _____0.219542726874351
Infiltration _____0.240452200174332
Mass _____0.291741400957108
Nodule _____0.076222963631153
Pleural_Thickening_____ 0.294208467006683
Pneumonia _____0.281939893960953
Pneumothorax _____0.386653006076813

我想要的： 我们可以通过两种方法找到它：例如为每个单独的 class 获取这些行。就像首先找到包含 Cardiomegaly 单标签或多标签的所有行。

然后应用下面的操作或根据需要和专业知识找到TP。

我想要的图像是否具有 ['Cardiomegaly', 'Edema', 'Infiltration'] 之类的基本事实并具有 14 种病理概率。如果这些实际标签具有最高概率值，我想找到 True Positive：

喜欢Cardiomegaly，如果它找到了最高概率-然后制作一个新的col并将其放入True。我不知道我应该为多标签做些什么，首先找到我应该为第二个 label 做些什么，如果它的概率最高，那么我该如何操作。我在@tlentali 的帮助下完成了最后一次尝试，谢谢大家帮助我。这是我所做的：

df = pd.read_csv('/home/ali/Desktop/CX/sample.csv')
df["best_score"] = df.drop(['file', 'set', 'label', 'bbx'], axis=1).idxmax(axis=1)
df['evaluation'] = df.apply(lambda x: x["best_score"] in x["label"], axis=1)
df.groupby('best_score')['evaluation'].mean()

这让我喜欢：

best_score
Atelectasis           0.452465
Cardiomegaly          0.250000
Consolidation         0.123164
Edema                 0.029520
Effusion              0.555459
Emphysema             0.068618
Fibrosis              0.066116
Hernia                0.032258
Infiltration          0.400000
Mass                  0.177524
Nodule                0.604167
Pleural_Thickening    0.188482
Pneumonia             0.049133
Pneumothorax          0.108156
Name: evaluation, dtype: float64

这不是我想要的，它只适用于单个标签，不适用于多个标签。请帮助我，对于冗长的描述感到抱歉，但这只是每个人都明白我想要什么。谢谢

Answer 1

来自你的 DataFrame :

>>> import pandas as pd

>>> df
                file    set     label                                        bbx    Atelectasis Cardiomegaly    Consolidation   Edema   Effusion    Emphysema   Fibrosis    Hernia  Infiltration    Mass    Nodule  Pleural_Thickening  Pneumonia   Pneumothorax
0   00000003_000.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.145712    0.028958    0.205006    0.055228    0.115680    0.376638    0.349124    0.357694    0.122496    0.202218    0.075018    0.118994    0.195345    0.215577
1   00000003_001.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.132639    0.046136    0.169713    0.092743    0.285383    0.614464    0.311035    0.344040    0.117032    0.447748    0.152327    0.094364    0.174125    0.316022
2   00000003_002.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.233026    0.042541    0.227911    0.047988    0.116835    0.595102    0.330304    0.367272    0.117985    0.298624    0.109354    0.133473    0.185444    0.379627
3   00000003_003.png    Test    [[[0.0, 0.0, 1024.0, 1024.0], [0.0, 0.0, 1024....   False   0.298693    0.022646    0.237977    0.035348    0.143645    0.487804    0.384509    0.379062    0.083205    0.625744    0.102377    0.207353    0.184517    0.354402
4   00000003_004.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.522152    0.052897    0.237475    0.082139    0.200029    0.473421    0.377468    0.336104    0.106339    0.488078    0.088047    0.146686    0.200919    0.313684

首先，我们 eval 列 label 以提取我们期望预测的 class :

>>> df['label'] = df['label'].apply(eval)
>>> df['class'] = df.label.apply(lambda x: x[1])
>>> df
0                                              [Hernia]
1                                              [Hernia]
2                                              [Hernia]
3                                [Hernia, Infiltration]
4                                              [Hernia]
5                                              [Hernia]
6                                              [Hernia]
7                                              [Hernia]
8                                          [No Finding]
9                             [Emphysema, Pneumothorax]
10                            [Emphysema, Pneumothorax]
11                                 [Pleural_Thickening]
12    [Effusion, Emphysema, Infiltration, Pneumothorax]
13    [Emphysema, Infiltration, Pleural_Thickening, ...
14                             [Effusion, Infiltration]
15                                       [Infiltration]
Name: class, dtype: object

然后，我们 explode 列 class 得到预期的 class 行，如下所示：

>>> df = df.explode('class')
>>> df = df.reset_index(drop=True)
>>> df['class']
0                 Hernia
1                 Hernia
2                 Hernia
3                 Hernia
4           Infiltration
5                 Hernia
6                 Hernia
7                 Hernia
8                 Hernia
9             No Finding
10             Emphysema
11          Pneumothorax
12             Emphysema
13          Pneumothorax
14    Pleural_Thickening
15              Effusion
16             Emphysema
17          Infiltration
18          Pneumothorax
19             Emphysema
20          Infiltration
21    Pleural_Thickening
22          Pneumothorax
23              Effusion
24          Infiltration
25          Infiltration
Name: class, dtype: object

然后，我们将数据转换为虚拟格式：

>>> classes = ['Atelectasis', 
...            'Cardiomegaly',
...            'Consolidation', 
...            'Edema', 
...            'Effusion', 
...            'Emphysema', 
...            'Fibrosis', 
...            'Hernia',
...            'Infiltration', 
...            'Mass', 
...            'Nodule', 
...            'Pleural_Thickening', 
...            'Pneumonia',
...            'Pneumothorax',
...            'No Finding']
>>> s = df['class']
>>> df_classes = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
>>> df_classes.head()
    Effusion    Emphysema   Hernia  Infiltration    No Finding  Pleural_Thickening  Pneumothorax
0   0           0           1       0               0           0                   0
1   0           0           1       0               0           0                   0
2   0           0           1       0               0           0                   0
3   0           0           1       0               0           0                   0
4   0           0           0       1               0           0                   0

由于我们目前正在处理玩具数据集，因此我们必须进行一些调整以利用所有所需的 classes 作为虚拟格式：

>>> df_classes['Atelectasis'] = 0 
>>> df_classes['Cardiomegaly'] = 0 
>>> df_classes['Consolidation'] = 0 
>>> df_classes['Edema'] = 0 
>>> df_classes['Fibrosis'] = 0 
>>> df_classes['Mass'] = 0 
>>> df_classes['Nodule'] = 0 
>>> df_classes['Pneumonia'] = 0 
>>> df['No Finding'] = 0

现在，我们可以使用 sklearn 来获得我们的 TRP 并最终获得 AUC :

from sklearn.metrics import roc_curve, auc


n_classes = len(classes)
y_test = df_classes[classes].to_numpy()
y_score = df[classes].to_numpy()

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

现在，我们可以看看 roc_auc 值，nan 是因为并非所有 classes 都在玩具数据集中被预测：

>>> roc_auc
 1: nan,
 2: nan,
 3: nan,
 4: 0.3125,
 5: 0.7613636363636364,
 6: nan,
 7: 0.9479166666666666,
 8: 0.6190476190476191,
 9: nan,
 10: nan,
 11: 0.30208333333333337,
 12: nan,
 13: 0.7840909090909091,
 14: 0.5,
 'micro': 0.66562764158918}

我们现在可以根据 TPR 和 FPR 为每个 class 绘制 ROC_AUC 曲线（这里注意 classe，一些 class 在我们处理玩具数据集时是空的）：

import matplotlib.pyplot as plt


plt.figure()
lw = 2
classe = 7
plt.plot(fpr[classe], tpr[classe], color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[classe])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")
plt.show()

如何在拥有 Ground Truth 的情况下只为 Data Frame 找到 True Positive？

How to find True Postive only for Data Frame while having Ground Truth?

python

model

prediction

dataframe

pandas