如何选择或优化标签，以便我们获得更好的多类分类结果？

Question

最近我在做一个 Kaggle 项目“Prudential Life Insurance Assessment”，参赛者讨论改变标签以获得更好的指标。

在那场比赛中，目标有 8 类（1-8），但 one of the guy uses the different labels (-1.6, 0.7, 0.3, 3.15, 4.53, 6.5, 6.77, 9.0) or another example 他们使用 [-1.6, 0.7, 0.3, 3.15, 4.53, 6.5, 6.77, 9.0] 而不是 [1,2,3,4,5,6,7,8]。

我想知道如何得出这些神奇的数字？

我愿意接受任何ideas/tricks/suggestions做这样的改造。非常感谢您的意见！

示例代码

# imports
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
from sklearn import metrics
from sklearn.model_selection import train_test_split

# data
df = sns.load_dataset('iris')
df['species'] = pd.factorize(df['species'])[0]
df = df.sample(frac=1,random_state=100)

# train test split
X = df.drop('species',axis=1)
y = df['species']
Xtrain,  Xtest, ytrain, ytest = train_test_split(X,y,stratify=y,random_state=100)

# modelling
model = xgb.XGBClassifier(objective='multi:softprob', random_state=100)
model.fit(Xtrain, ytrain)
preds = model.predict(Xtest)
kappa = metrics.cohen_kappa_score(ytest, preds, weights='quadratic')

print(kappa)

我的想法

labels可以取无限个数，如何把[1-8]转成[x-y]？
我们是否应该随机选择 8 个数字并检查所有数字的 kappa。这似乎是最不合理的想法，可能永远不会奏效。
这有什么梯度下降法吗？也许不是，只是一个想法。

参考链接

Answer 1

您问题中的第一个 link 实际上包含答案：

#The hardcoded values were obtained by optimizing a CV score using simulated annealing

稍后作者评论：

At first I was optimising the parameters one by one but then I switched to optimising them simultaneously by a combination of grid search and simulated annealing. I am not sure I found a global maximum of the CV score though, even after playing around with various settings of the simulated annealing. Maybe genetic algorithms would help.

第二个 link 的解决方案具有相同的值，因为（可能）作者从第一个解决方案中复制了它们（参见他们的评论）：

Inspired by: https://www.kaggle.com/mariopasquato/prudential-life-insurance-assessment/linear-model/code

简而言之 - 您可以将这些值视为学习算法的元参数（好吧，它们是）。通过这种方式，您可以定义一个函数 F(metaparameters)，以便计算它的单个值，您可以对训练集进行全面训练，并在验证集上输出损失（或者更好的做法是使用 n 折交叉验证并使用 CV 损失）。然后你的任务就变成了优化函数 F 以使用你喜欢的任何优化方法找到最佳元参数集的方式 - 例如第一个解决方案的作者声称他们使用了网格搜索和模拟退火。

没有针对优化本身进行元调整的小示例：

import numpy as np
cnt = 0
def use_a_function_which_calls_training_and_computes_cv_instead_of_this(x):
    global cnt
    cnt += 1
    return ((x - np.array([-1.6, 0.7, 0.3, 3.15, 4.53, 6.5, 6.77, 9.0]))**2).sum()

my_best_guess_for_the_initial_parameters = np.array([1.,2.,3.,4.,5.,6.,7.,8.])
optimization_results = scipy.optimize.basinhopping(
    use_a_function_which_calls_training_and_computes_cv_instead_of_this,
    my_best_guess_for_the_initial_parameters,
    niter=100)
print("Times function was called: {0}".format(cnt))
print(optimization_results.x)

示例输出：

Times function was called: 3080
[-1.6         0.7         0.3         3.15        4.52999999  6.5
  6.77        8.99999999]

您很可能想要试验优化本身的参数，甚至可能编写您的自定义优化器 and/or 回调来制定步骤。但也有可能即使是默认参数也至少在某种程度上对您有用。如果您发现对函数进行一次计算的时间太多，您可以例如尝试使用完整数据的较小子集等进行初始优化。

如何选择或优化标签，以便我们获得更好的多类分类结果？

How to choose or optimize the labels so that we get better multiclass classification results?

python

machine-learning

pandas

xgboost

feature-engineering

示例代码

我的想法

参考链接