如何计算具有多列的多个 Pandas 数据框的准确性

How to Calculate Accuracy of multiple Pandas dataframe with multiple columns

我有多个 pandas 数据帧如下:

data1 = {'1':[4], '2':[2], '3':[6]}
baseline = pd.DataFrame(data1)

 # baseline output  
   1  2  3
0  4  2  6

data2 = {'1':[3], '2':[5], '5':[5]}
forecast1 = pd.DataFrame(data2)

# forecast1 output
   1  2  5
0  3  5  5

data3 = {'1':[2], '3':[4], '5':[5], '6':[2]}
forecast2 = pd.DataFrame(data3)

# forecast2 output
   1  3  5  6
0  2  4  5  2

我如何计算 forecast1forecast2(分别)与 baseline 数据框(即 baseline 与 forecast1 和 baseline 与 forecast2)?

另请注意,与基线数据框相比,forecast1 和 forecast2 可能有一些额外的列。因此,精度计算需要考虑可用列的数量并处理额外的列。有没有办法处理这种情况?

这些数据框是我正在做的数据清理的结果,这就是为什么其中一些数据框有一些额外的列在基线数据框中不可用。

感谢您的帮助。

谢谢。

print(baseline.columns)
print(forecast1.columns)
print(forecast2.columns)
Index(['1', '2', '3'], dtype='object')
Index(['1', '2', '5'], dtype='object')
Index(['1', '3', '5', '6'], dtype='object')

您可以获取列的交集来找出哪些列在基准和预测之间是通用的,然后只需在这些列上应用 accuracy_score。

from sklearn.metrics import accuracy_score

common_columns = list(set(baseline.columns).intersection(forecast1.columns))

avg_acc = 0.0
for c in common_columns:
    c_acc = accuracy_score(baseline[c], forecast1[c])
    print(f'Column {c} acc: {c_acc}')
    avg_acc += c_acc/len(common_columns)

print(avg_acc)

写一个函数来获取基线和一个预测来给你准确性。

from sklearn.metrics import accuracy_score

def calc_acc(baseline, forecast1):
    common_columns = list(set(baseline.columns).intersection(forecast1.columns))

    avg_acc = 0.0
    for c in common_columns:
        c_acc = accuracy_score(baseline[c], forecast1[c])
        print(f'Column {c} acc: {c_acc}')
        avg_acc += c_acc/len(common_columns)

    print(avg_acc)
    return avg_acc
from sklearn.metrics import accuracy_score

def calc_acc(baseline, forecast1):
    penalize = True
    common_columns = list(set(baseline.columns).intersection(forecast1.columns))

    avg_acc = 0.0
    for c in common_columns:
        c_acc = accuracy_score(baseline[c], forecast1[c])
        print(f'Column {c} acc: {c_acc}')
        if penalize:
            div = len(common_columns) + abs(len(forecast1.columns) - len(baseline.columns)) # it will penalize for both having more or less columns than baseline, you can change it based on your needs
            avg_acc += c_acc/div
        else:
            avg_acc += c_acc/len(common_columns)

    print(avg_acc)
    return avg_acc

对于回归尝试均值绝对误差,误差越低预测越好。

from sklearn.metrics import accuracy_score, mean_absolute_error

def calc_acc(baseline, forecast1):
    penalize = True
    common_columns = list(set(baseline.columns).intersection(forecast1.columns))

    avg_acc = 0.0
    for c in common_columns:
        c_acc = mean_absolute_error(baseline[c], forecast1[c])
        print(f'Column {c} mean absolute error: {c_acc}')
        if penalize:
            div = len(common_columns) + abs(len(forecast1.columns) - len(baseline.columns)) # it will penalize for both having more or less columns than baseline, you can change it based on your needs
            avg_acc += c_acc/div
        else:
            avg_acc += c_acc/len(common_columns)

    print(avg_acc)
    return avg_acc

通常,平均正确率大约为 100% - 平均错误率。因此,您只需从 100% 中减去误差即可。

def perc(a_list, b_list):
    ans = 0.0

    for i in range(len(a_list)):
        ans += (1. - abs(a_list[i]-b_list[i])/a_list[i])

    return ans

from sklearn.metrics import accuracy_score, mean_absolute_error

def calc_acc(baseline, forecast1):
    penalize = True
    common_columns = list(set(baseline.columns).intersection(forecast1.columns))

    avg_acc = 0.0
    for c in common_columns:
        c_acc = perc(baseline[c], forecast1[c])
        print(f'Column {c} mean percentange correct: {c_acc}')
        if penalize:
            div = len(common_columns) + abs(len(forecast1.columns) - len(baseline.columns)) # it will penalize for both having more or less columns than baseline, you can change it based on your needs
            avg_acc += c_acc/div
        else:
            avg_acc += c_acc/len(common_columns)

    print(avg_acc)
    return avg_acc