为什么 knnImpute preProcess 会改变其他数据？这是错误的吗

Question

这是一些数据：

> head(p.full)[,1:3]
     id timestamp full_sq
1 30474     16617   39.00
2 30475     16617   79.20
3 30476     16617   40.50
4 30477     16617   62.80
5 30478     16617   40.00
6 30479     16617   48.43

上面没有显示一些缺失值，所以我使用插入符号中的预处理来用中值填充它们：

p.full.medians <- predict(preProcess(p.full, method=c("medianImpute")), p.full)

> head(p.full.medians)[,1:3]
     id timestamp full_sq
1 30474     16617   39.00
2 30475     16617   79.20
3 30476     16617   40.50
4 30477     16617   62.80
5 30478     16617   40.00
6 30479     16617   48.43

与上面完全相同，因为我显示的相同 df 没有缺失值。

但后来我尝试使用 knn 估算：

p.full.knn <- predict(preProcess(p.full, method=c("knnImpute")), p.full)
> head(p.full.knn)[,1:3]
        id timestamp    full_sq
1 1.036042 0.9665495 -0.4296467
2 1.036133 0.9665495  0.7133352
3 1.036224 0.9665495 -0.3869981
4 1.036315 0.9665495  0.2470441
5 1.036405 0.9665495 -0.4012143
6 1.036496 0.9665495 -0.1615293

现在整个数据帧的值都已更改，而我预计只有 NA 值会更改。

这是预期的吗？我是否误解了 knnImpute 的工作原理？

Answer 1

这是预期的，并在文档中提到。使用 knnImpute 时，默认情况下数据会缩放和居中（这就是为什么您会看到值在零附近）。

来自documentation：

preProcess can be used to impute data sets based only on information in the training set. One method of doing this is with K-nearest neighbors. For an arbitrary sample, the K closest neighbors are found in the training set and the value for the predictor is imputed using these values (e.g. using the mean). Using this approach will automatically trigger preProcess to center and scale the data, regardless of what is in the method argument.

为什么 knnImpute preProcess 会改变其他数据？这是错误的吗

Why does knnImpute preProcess change other data? Is this in error

r

knn

missing-data

r-caret