ML 足够的特征?

ML enough features?

我正在尝试在加速度计数据集上训练随机森林。我计算诸如均值、标准差、轴之间的相关性、曲线下面积等特征。我是 ML 小白。

我想了解两件事:

1.If 我将一个人的数据集分成测试和训练,运行 RF 预测准确率很高 (> 90%)。但是,如果我用来自不同人的数据训练 RF,然后进行预测,则准确度很低 (< 50%)。为什么?我该如何调试?不知道我做错了什么。

  1. 在上面的例子中,要达到 90% 的准确率,有多少特征是 "enough"? "enough" 有多少数据?

我可以提供更多细节。数据集来自 10 个人,大文件的标记数据。我将自己限制在上述功能上以避免大量计算。

  1. 很可能你的分类器过拟合了,当你只在一个人身上训练它时它不能很好地泛化,它可能只是 "memorize" 带有标签的数据集而不是捕获 [=25= 的一般规则] 每个特征都与 other/how 相关,它们会影响 result/etc。也许您需要更多数据或更多功能。

  2. 这不是那么简单的问题,是泛化问题,对此有很多理论研究,例如:Vapnik–Chervonenkis theory Akaike_information_criterion. And even with knowledge of such theories you cannot answer to this question accurately. The main principle of most of such theories - the more data you have, less variative model you trying to fit and less difference between accuracy on training and test you requiring - this theories will rank your model higher. E.g if you wan't to minimize difference between accuracy on test and training set (to make sure that accuracy on test data will not collapse) - you need to increase amount of data, provide more meaningful features (with respect to your model), or use less variative model for fitting. If you interesting in a more detailed explanation about theoretical aspect, you can watch lectures from caltech, starting from this CaltechX - CS1156x Learning from data.