什么是无意义的数据?

What is pointless data?

我正在阅读 tutorial about SVM

他在那里写道:

The Support Vector Machine, in general, handles pointless data better than the K Nearest Neighbors algorithm

他说的"pointless data"是什么意思?

该句引用前一句:

Note that if we comment out the drop id column part, accuracy goes back down into the 60s.

KNearestNeighbors tutorial 如果将 'useless' 数据(又名噪声)(如数据点的索引)作为输入提供给模型,则研究模型性能的变化。

[...] let's show what happens when we do indeed include truly meaningless and misleading data by commenting out the dropping of the id column

这里的结论是,SVM 比 KNN 更好地处理输入中无意义的特征、噪声或'pointless data'。

在此上下文中,它用于描述任何分类决策不应所依据的数据。在这种特殊情况下,作者引用了包含行标识符的 ID 列。他们认为这些数据与决策任务无关,因此将其称为 "meaningless" 甚至 "misleading".

the article(强调我的)的更多上下文更容易理解:

Note that if we comment out the drop id column part, accuracy goes back down into the 60s. The Support Vector Machine, in general, handles pointless data better than the K Nearest Neighbors algorithm, and definitely will handle outliers better, but, in this example, the meaningless data is still very misleading for us.

这在系列的 an earlier part 中得到进一步证实(强调我的):

The result should be about 95%, and that's out of the box without any tweaking. Very cool! Just for show, let's show what happens when we do indeed include truly meaningless and misleading data by commenting out the dropping of the id column:

讨论

该评估是否正确取决于实际数据集。如果收集到的数据足以从中获得令人满意的结果,那么删除这样的列可能是个好主意。另一方面,可以想象一个假设示例,其中 ID 列与数据一起生成并包含一个自动递增的整数。现在它包含有关条目顺序的信息。如果数据集中恰好没有其他序列信息(例如时间戳),那么ID可能不是没有意义。