为什么 knnImpute preProcess 会改变其他数据?这是错误的吗
Why does knnImpute preProcess change other data? Is this in error
这是一些数据:
> head(p.full)[,1:3]
id timestamp full_sq
1 30474 16617 39.00
2 30475 16617 79.20
3 30476 16617 40.50
4 30477 16617 62.80
5 30478 16617 40.00
6 30479 16617 48.43
上面没有显示一些缺失值,所以我使用插入符号中的预处理来用中值填充它们:
p.full.medians <- predict(preProcess(p.full, method=c("medianImpute")), p.full)
> head(p.full.medians)[,1:3]
id timestamp full_sq
1 30474 16617 39.00
2 30475 16617 79.20
3 30476 16617 40.50
4 30477 16617 62.80
5 30478 16617 40.00
6 30479 16617 48.43
与上面完全相同,因为我显示的相同 df 没有缺失值。
但后来我尝试使用 knn 估算:
p.full.knn <- predict(preProcess(p.full, method=c("knnImpute")), p.full)
> head(p.full.knn)[,1:3]
id timestamp full_sq
1 1.036042 0.9665495 -0.4296467
2 1.036133 0.9665495 0.7133352
3 1.036224 0.9665495 -0.3869981
4 1.036315 0.9665495 0.2470441
5 1.036405 0.9665495 -0.4012143
6 1.036496 0.9665495 -0.1615293
现在整个数据帧的值都已更改,而我预计只有 NA 值会更改。
这是预期的吗?我是否误解了 knnImpute 的工作原理?
这是预期的,并在文档中提到。使用 knnImpute
时,默认情况下数据会缩放和居中(这就是为什么您会看到值在零附近)。
preProcess can be used to impute data sets based only on information in the training set. One method of doing this is with K-nearest neighbors. For an arbitrary sample, the K closest neighbors are found in the training set and the value for the predictor is imputed using these values (e.g. using the mean). Using this approach will automatically trigger preProcess to center and scale the data, regardless of what is in the method argument.
这是一些数据:
> head(p.full)[,1:3]
id timestamp full_sq
1 30474 16617 39.00
2 30475 16617 79.20
3 30476 16617 40.50
4 30477 16617 62.80
5 30478 16617 40.00
6 30479 16617 48.43
上面没有显示一些缺失值,所以我使用插入符号中的预处理来用中值填充它们:
p.full.medians <- predict(preProcess(p.full, method=c("medianImpute")), p.full)
> head(p.full.medians)[,1:3]
id timestamp full_sq
1 30474 16617 39.00
2 30475 16617 79.20
3 30476 16617 40.50
4 30477 16617 62.80
5 30478 16617 40.00
6 30479 16617 48.43
与上面完全相同,因为我显示的相同 df 没有缺失值。
但后来我尝试使用 knn 估算:
p.full.knn <- predict(preProcess(p.full, method=c("knnImpute")), p.full)
> head(p.full.knn)[,1:3]
id timestamp full_sq
1 1.036042 0.9665495 -0.4296467
2 1.036133 0.9665495 0.7133352
3 1.036224 0.9665495 -0.3869981
4 1.036315 0.9665495 0.2470441
5 1.036405 0.9665495 -0.4012143
6 1.036496 0.9665495 -0.1615293
现在整个数据帧的值都已更改,而我预计只有 NA 值会更改。
这是预期的吗?我是否误解了 knnImpute 的工作原理?
这是预期的,并在文档中提到。使用 knnImpute
时,默认情况下数据会缩放和居中(这就是为什么您会看到值在零附近)。
preProcess can be used to impute data sets based only on information in the training set. One method of doing this is with K-nearest neighbors. For an arbitrary sample, the K closest neighbors are found in the training set and the value for the predictor is imputed using these values (e.g. using the mean). Using this approach will automatically trigger preProcess to center and scale the data, regardless of what is in the method argument.