创建交叉验证索引时随机重新排列数据点？

Randomly rearranging data points when creating cross-validation indices?

我有一个数据集，其中列对应于特征（预测变量），行对应于数据点。数据点以结构化方式提取，即它们是排序的。我将使用来自 Matlab 的 crossvalind or cvpartition 进行分层交叉验证。

如果我使用上述功能，是否还需要先随机重新排列数据点（行）？

如您在文档中所见，这些函数会在内部打乱您的数据

Indices = crossvalind('Kfold', N, K) returns randomly generated indices for a K-fold cross-validation of N observations. Indices contains equal (or approximately equal) proportions of the integers 1 through K that define a partition of the N observations into K disjoint subsets. Repeated calls return different randomly generated partitions. K defaults to 5 when omitted. In K-fold cross-validation, K-1 folds are used for training and the last fold is used for evaluation. This process is repeated K times, leaving one different fold for evaluation each time.

但是，如果你的数据是这种意义上的结构化，那么第 i 个对象有一些关于第 i+1 个对象的信息，那么你应该考虑不同类型的拆分。例如 - 如果您的数据实际上是一个（本地）时间序列，典型的随机 cv 是 不是有效的估计技术 。为什么？因为如果您的数据实际上包含集群，其中至少有一个元素的价值知识 - 让您很有可能估计剩余的元素，那么在应用 CV 之后您最终将获得的实际上是对这样做的能力的估计 - 预测 在这些集群中。因此，如果在模型的实际使用过程中，您希望获得全新的集群——您选择的模型在那里可能是完全随机的。换句话说 - 如果您的数据具有某种内部集群结构（或时间序列），您的拆分应该通过拆分集群来覆盖此功能（因此不是 K 随机点拆分，而是 K 随机集群拆分等等）。

创建交叉验证索引时随机重新排列数据点？

Randomly rearranging data points when creating cross-validation indices?

matlab

machine-learning

cross-validation