对数据集进行分层采样并对训练数据集中的变量进行平均

Stratified Sampling a Dataset and Averaging a Variable within the Train Dataset

我目前正在尝试在 R 中进行分层拆分以创建训练和测试数据集。 给我的问题如下

split the data into a train and test sample such that 70% of the data is in the train sample. To ensure a similar distribution of price across the train and test samples, use createDataPartition from the caret package. Set groups to 100 and use a seed of 1031. What is the average house price in the train sample?

数据集是一组带有价格的房屋(以及其他数据点)

出于某种原因,当我 运行 以下代码时,我得到的输出在练习题模拟器中被标记为不正确。谁能发现我的代码有问题?非常感谢任何帮助,因为我正在努力避免错误地学习这门语言。

dput(head(houses))

library(ISLR); library(caret); library(caTools)
options(scipen=999)

set.seed(1031)
#STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
split = createDataPartition(y = houses$price,p = 0.7,list = F, groups = 100)

train = houses[split,]
test = houses[-split,]

nrow(train)
nrow(test)
nrow(houses)

mean(train$price)
mean(test$price)

输出

> dput(head(houses))
structure(list(id = c(7129300520, 6414100192, 5631500400, 2487200875, 
1954400510, 7237550310), price = c(221900, 538000, 180000, 604000, 
510000, 1225000), bedrooms = c(3, 3, 2, 4, 3, 4), bathrooms = c(1, 
2.25, 1, 3, 2, 4.5), sqft_living = c(1180, 2570, 770, 1960, 1680, 
5420), sqft_lot = c(5650, 7242, 10000, 5000, 8080, 101930), floors = c(1, 
2, 1, 1, 1, 1), waterfront = c(0, 0, 0, 0, 0, 0), view = c(0, 
0, 0, 0, 0, 0), condition = c(3, 3, 3, 5, 3, 3), grade = c(7, 
7, 6, 7, 8, 11), sqft_above = c(1180, 2170, 770, 1050, 1680, 
3890), sqft_basement = c(0, 400, 0, 910, 0, 1530), yr_built = c(1955, 
1951, 1933, 1965, 1987, 2001), yr_renovated = c(0, 1991, 0, 0, 
0, 0), age = c(59, 63, 82, 49, 28, 13)), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
> 
> library(ISLR); library(caret); library(caTools)
> options(scipen=999)
> 
> set.seed(1031)
> #STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
> split = createDataPartition(y = houses$price,p = 0.7,list = F, groups = 100)
> 
> train = houses[split,]
> test = houses[-split,]
> 
> nrow(train)
[1] 15172
> nrow(test)
[1] 6441
> nrow(houses)
[1] 21613
> 
> mean(train$price)
[1] 540674.2
> mean(test$price)
[1] 538707.6

我尝试使用 sample_frac 形式 dplyr 包和 Hmisc 包中的 cut2 函数手动复制它。结果几乎相同 - 仍然不相同。 看起来伪数生成器或某些舍入可能存在问题。 在我看来,您的代码看起来是正确的。 是否有可能在前面的步骤中,您应该以任何方式删除一些异常值或 pre-process 数据集。

library(caret)
options(scipen=999)

library(dplyr)
library(ggplot2) # to use diamonds dataset
library(Hmisc)

diamonds$index = 1:nrow(diamonds)

set.seed(1031)

# I use diamonds dataset from ggplot2 package
# g parameter (in cut2) - number of quantile groups

split = diamonds %>% 
group_by(cut2(diamonds$price, g= 100)) %>% 
sample_frac(0.7) %>%
pull(index)

train = diamonds[split,]
test = diamonds[-split,]

> mean(train$price)
[1] 3932.75
> mean(test$price)
[1] 3932.917

set.seed(1031)
#STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
split = createDataPartition(y = diamonds$price,p = 0.7,list = T, groups = 100)


train = diamonds[split$Resample1,]
test = diamonds[-split$Resample1,]

> mean(train$price)
[1] 3932.897
> mean(test$price)
[1] 3932.572

此抽样程序的结果应该接近总体。