拆分训练集和测试集但索引输入与下标相差 1——为什么？

Question

我已将我的数据分成训练集和测试集，但我一直收到一个错误

! Must subset rows with a valid subscript vector. ℹ Logical subscripts must match the size of the indexed input. x Input has size 4067 but subscript split_data_table == 0 has size 4066.

我的数据名为“JFK_weather_clean2”。为了执行拆分，我做了：

set.seed(1234)
split_data_table <- sample(c(rep(0, 0.8 * nrow(JFK_weather_clean2)), rep(1, 0.2 * nrow(JFK_weather_clean2))))

table(split_data_table) 结果：

0	1
3253	813

从那里我尝试创建训练集：

training_set <- JFK_weather_clean2[split_data_table == 0, ]

您可能已经注意到，我的输入数据包含 4,067 行（其中包括 header 行），而下标的大小为 4,066。我假设此问题涉及 header 行，但我不知道要在我的 sample() 代码中进行什么更正。感谢您的帮助！

Answer 1

你的问题的原因是你用来拆分数据的 rep 函数有 times 参数，强制输入到整数或双精度向量。此行为已在 rep.

的文档中进行了解释

A double vector is accepted, other inputs being coerced to an integer or double vector.

此行为可能会导致将输入四舍五入为不大于输入的最大整数。例如mtcars有32行，其中80%为25.6，但是如果用rep，则四舍五入为25，而不是26。

0.8 * nrow(mtcars)
# [1] 25.6
length(c(rep(0, 0.8 * nrow(mtcars))))
[1] 25

如果您将代码应用于拆分 mtcars，您将得到总共 31 行，而不是预期的 32 行。

length(c(rep(0, 0.8 * nrow(mtcars)), rep(1, 0.2 * nrow(mtcars))))
# [1] 31

当拆分数据中的行数是整数时，rep 中的这种舍入行为不是问题，例如 iris，它有 150 行，因此其中的 80%是 120.

length(c(rep(0, 0.8 * nrow(iris)), rep(1, 0.2 * nrow(iris))))
# [1] 150

获得正确总行数的另一种解决方案是在中使用 round rep 函数中 times 参数的输入。

length(c(rep(0, round(0.8 * nrow(mtcars))), rep(1, round(0.2 * nrow(mtcars)))))
# [1] 32

拆分训练集和测试集但索引输入与下标相差 1——为什么？

Split Train & Test Sets but Indexed Input Differs from Subscript by 1--why?

r

anova