数据集中的规模

Question

在数据集中缩放哪个变量

Answer 1

我认为this从理论上回答了你的问题。

请记住，如果您想构建统计模型，您可能希望将数据划分为训练集和测试集（可能还有验证集）。那样的话，就需要先独立的对训练集进行缩放，然后再根据训练集的均值和平均值对测试集进行缩放！这是为了避免将信息从测试集“泄漏”到训练集。

从编码的角度来看[一个更适合 Whosebug 的主题]：


# split 80-20 of training set and test set
p <- 0.8

# set seed for reproducibility
set.seed(1)
trn_rows <- sample(nrow(mtcars), nrow(mtcars) * p)

# training and test sets
trn <- mtcars[trn_rows, ]
tst <- mtcars[-trn_rows, ]

# calc mean and sd for each column of the training set
mean_trn <- apply(trn, 2, mean)
sd_trn   <- apply(trn, 2, sd)

# scale traing and test
trn_scaled <- scale(trn, center = mean_trn, scale = sd_trn)
tst_scaled <- scale(tst, center = mean_trn, scale = sd_trn)

数据集中的规模

scale in dataset

r

data-mining

data-science