在多个 csv 文件之间进行投票的自定义方法

Question

我有 3 个（或更多）具有这种结构的数据帧：

ID	Percentage
00001	3
00002	15
00003	73
00004	90
...	...

每个 csv 都有唯一的预测百分比值

在这些 csv 中，一个 csv 有很好的 MAE，所以我想给它更大的权重，如果有 2 个或更多预测相同的值，我希望它被考虑（即使我想要的值彼此接近取值的平均值）

这是我的代码：

df1 = pd.read_csv("BlahBlahBlah01.csv",index_col=0)
df2 = pd.read_csv("BlahBlahBlah02.csv",index_col=0)
df3 = pd.read_csv("BlahBlahBlah03.csv",index_col=0)
dfGold = pd.read_csv("BlahBlahBlahGold.csv",index_col=0)

# all dataframes have the same shape
lenOfDF = 1000

newCSV = pd.DataFrame(columns = ['ID','Percentage'])
newCSV['ID'] = df1['ID']

for i in range(lenOfDF):
    pred01 = df1['Percentage'][i]
    pred02 = df2['Percentage'][i]
    pred03 = df3['Percentage'][i]
    predGold = dfGold['Percentage'][i]

    # all lines below are not real code (((Just pseudocode)))
    if pred01 == Any(pred02,pred03,predGold):
        newCSV['Percentage'][i] = pred01
    elif pred02 == Any(pred01,pred03,predGold):
        newCSV['Percentage'][i] = pred02
    elif pred03 == Any(pred01,pred02,predGold):
        newCSV['Percentage'][i] = pred03
    else:
        newCSV['Percentage'][i] = predGold

我知道这是非常基础的，不能提供很好的预测，所以我需要帮助来修复它。

就像我上面说的，我想给权重，我也想考虑与 +- 5

彼此接近的值

我知道有集成技术，但我有 csv 文件而不是模型。

谢谢...

Answer 1

    csv_list = ['BlahBlahBlah01','BlahBlahBlah02','BlahBlahBlah03','BlahBlahBlahGold']

    preds = []

    for i, pred in enumerate(csv_list):
        pred = pd.read_csv(f"./{pred}.csv", index_col=0)
        pred.rename(columns={"Percentage": i}, inplace=True)
        preds.append(pred)
    preds = pd.concat(preds, axis=1)

    preds["Percentage"] = preds.mode(axis=1)[0]

    df= pd.read_csv("BlahBlahBlah01.csv", index_col=0)
    preds["Id"]=df.index

    preds.to_csv("output.csv" ,columns=['Id', 'Percentage'], index=False)

在多个 csv 文件之间进行投票的自定义方法

custom method for voting bewtween multiple csv files

python

csv

pandas

ensemble-learning