预期和预测的数组最终在 scikit 学习随机森林模型中相同

Question

data = df_train.as_matrix(columns=train_vars)  # All columns aside from 'output'
target = df_train.as_matrix(columns=['output']).ravel()

# Get training and testing splits
splits = cross_validation.train_test_split(data, target, test_size=0.2)
data_train, data_test, target_train, target_test = splits

# Fit the training data to the model
model = RandomForestRegressor(100)
model.fit(data_train, target_train)

# Make predictions
expected = target_test
predicted = model.predict(data_test)

当我运行此代码预测变量 'output' 作为此文件中所有其他变量的函数时：https://www.dropbox.com/s/cgyh09q2liew85z/uuu.csv?dl=0

预期和预测的数组完全相同。好像我过度拟合或做错了什么。如何解决？

Answer 1

质疑结果太好了，点赞！

数据中的每个特征（列）只包含少量不同的值。如果我没数错的话，只有 14 个不同的行。

这有两个含义：

你很可能过拟合了，因为你只有14个有效样本，却有36个特征。
相同的行非常可能再次出现在测试集中和训练集中。这意味着您正在测试与训练模型相同的数据。由于模型完全过度拟合此数据，因此您可以获得完美的结果。

编辑

我刚刚意识到我还没有回答实际问题 - 如何解决？

这取决于。

幸运的话，有人在准备数据时出错了。

如果数据是正确的，事情会更加困难。首先，删除重复的行，例如通过执行 np.vstack({tuple(row) for row in data})（参见 here）。然后尝试是否可以用它做一些有意义的工作。但老实说，我认为 14 个样本对于机器学习来说有点低。尝试获取更多数据:)

预期和预测的数组最终在 scikit 学习随机森林模型中相同

Expected and predicted arrays ending up to be the same in scikit learn random forest model

python

numpy

pandas

random-forest

scikit-learn