在多个 csv 文件之间进行投票的自定义方法
custom method for voting bewtween multiple csv files
我有 3 个(或更多)具有这种结构的数据帧:
ID
Percentage
00001
3
00002
15
00003
73
00004
90
...
...
每个 csv 都有唯一的预测百分比值
在这些 csv 中,一个 csv 有很好的 MAE,所以我想给它更大的权重,如果有 2 个或更多预测相同的值,我希望它被考虑(即使我想要的值彼此接近取值的平均值)
这是我的代码:
df1 = pd.read_csv("BlahBlahBlah01.csv",index_col=0)
df2 = pd.read_csv("BlahBlahBlah02.csv",index_col=0)
df3 = pd.read_csv("BlahBlahBlah03.csv",index_col=0)
dfGold = pd.read_csv("BlahBlahBlahGold.csv",index_col=0)
# all dataframes have the same shape
lenOfDF = 1000
newCSV = pd.DataFrame(columns = ['ID','Percentage'])
newCSV['ID'] = df1['ID']
for i in range(lenOfDF):
pred01 = df1['Percentage'][i]
pred02 = df2['Percentage'][i]
pred03 = df3['Percentage'][i]
predGold = dfGold['Percentage'][i]
# all lines below are not real code (((Just pseudocode)))
if pred01 == Any(pred02,pred03,predGold):
newCSV['Percentage'][i] = pred01
elif pred02 == Any(pred01,pred03,predGold):
newCSV['Percentage'][i] = pred02
elif pred03 == Any(pred01,pred02,predGold):
newCSV['Percentage'][i] = pred03
else:
newCSV['Percentage'][i] = predGold
我知道这是非常基础的,不能提供很好的预测,所以我需要帮助来修复它。
就像我上面说的,我想给权重,我也想考虑与 +- 5
彼此接近的值
我知道有集成技术,但我有 csv 文件而不是模型。
谢谢...
csv_list = ['BlahBlahBlah01','BlahBlahBlah02','BlahBlahBlah03','BlahBlahBlahGold']
preds = []
for i, pred in enumerate(csv_list):
pred = pd.read_csv(f"./{pred}.csv", index_col=0)
pred.rename(columns={"Percentage": i}, inplace=True)
preds.append(pred)
preds = pd.concat(preds, axis=1)
preds["Percentage"] = preds.mode(axis=1)[0]
df= pd.read_csv("BlahBlahBlah01.csv", index_col=0)
preds["Id"]=df.index
preds.to_csv("output.csv" ,columns=['Id', 'Percentage'], index=False)
我有 3 个(或更多)具有这种结构的数据帧:
ID | Percentage |
---|---|
00001 | 3 |
00002 | 15 |
00003 | 73 |
00004 | 90 |
... | ... |
每个 csv 都有唯一的预测百分比值
在这些 csv 中,一个 csv 有很好的 MAE,所以我想给它更大的权重,如果有 2 个或更多预测相同的值,我希望它被考虑(即使我想要的值彼此接近取值的平均值)
这是我的代码:
df1 = pd.read_csv("BlahBlahBlah01.csv",index_col=0)
df2 = pd.read_csv("BlahBlahBlah02.csv",index_col=0)
df3 = pd.read_csv("BlahBlahBlah03.csv",index_col=0)
dfGold = pd.read_csv("BlahBlahBlahGold.csv",index_col=0)
# all dataframes have the same shape
lenOfDF = 1000
newCSV = pd.DataFrame(columns = ['ID','Percentage'])
newCSV['ID'] = df1['ID']
for i in range(lenOfDF):
pred01 = df1['Percentage'][i]
pred02 = df2['Percentage'][i]
pred03 = df3['Percentage'][i]
predGold = dfGold['Percentage'][i]
# all lines below are not real code (((Just pseudocode)))
if pred01 == Any(pred02,pred03,predGold):
newCSV['Percentage'][i] = pred01
elif pred02 == Any(pred01,pred03,predGold):
newCSV['Percentage'][i] = pred02
elif pred03 == Any(pred01,pred02,predGold):
newCSV['Percentage'][i] = pred03
else:
newCSV['Percentage'][i] = predGold
我知道这是非常基础的,不能提供很好的预测,所以我需要帮助来修复它。
就像我上面说的,我想给权重,我也想考虑与 +- 5
彼此接近的值我知道有集成技术,但我有 csv 文件而不是模型。
谢谢...
csv_list = ['BlahBlahBlah01','BlahBlahBlah02','BlahBlahBlah03','BlahBlahBlahGold']
preds = []
for i, pred in enumerate(csv_list):
pred = pd.read_csv(f"./{pred}.csv", index_col=0)
pred.rename(columns={"Percentage": i}, inplace=True)
preds.append(pred)
preds = pd.concat(preds, axis=1)
preds["Percentage"] = preds.mode(axis=1)[0]
df= pd.read_csv("BlahBlahBlah01.csv", index_col=0)
preds["Id"]=df.index
preds.to_csv("output.csv" ,columns=['Id', 'Percentage'], index=False)