对数据集进行分层采样并对训练数据集中的变量进行平均
Stratified Sampling a Dataset and Averaging a Variable within the Train Dataset
我目前正在尝试在 R 中进行分层拆分以创建训练和测试数据集。
给我的问题如下
split the data into a train and test sample such that 70% of the data
is in the train sample. To ensure a similar distribution of price
across the train and test samples, use createDataPartition from the
caret package. Set groups to 100 and use a seed of 1031. What is the
average house price in the train sample?
数据集是一组带有价格的房屋(以及其他数据点)
出于某种原因,当我 运行 以下代码时,我得到的输出在练习题模拟器中被标记为不正确。谁能发现我的代码有问题?非常感谢任何帮助,因为我正在努力避免错误地学习这门语言。
dput(head(houses))
library(ISLR); library(caret); library(caTools)
options(scipen=999)
set.seed(1031)
#STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
split = createDataPartition(y = houses$price,p = 0.7,list = F, groups = 100)
train = houses[split,]
test = houses[-split,]
nrow(train)
nrow(test)
nrow(houses)
mean(train$price)
mean(test$price)
输出
> dput(head(houses))
structure(list(id = c(7129300520, 6414100192, 5631500400, 2487200875,
1954400510, 7237550310), price = c(221900, 538000, 180000, 604000,
510000, 1225000), bedrooms = c(3, 3, 2, 4, 3, 4), bathrooms = c(1,
2.25, 1, 3, 2, 4.5), sqft_living = c(1180, 2570, 770, 1960, 1680,
5420), sqft_lot = c(5650, 7242, 10000, 5000, 8080, 101930), floors = c(1,
2, 1, 1, 1, 1), waterfront = c(0, 0, 0, 0, 0, 0), view = c(0,
0, 0, 0, 0, 0), condition = c(3, 3, 3, 5, 3, 3), grade = c(7,
7, 6, 7, 8, 11), sqft_above = c(1180, 2170, 770, 1050, 1680,
3890), sqft_basement = c(0, 400, 0, 910, 0, 1530), yr_built = c(1955,
1951, 1933, 1965, 1987, 2001), yr_renovated = c(0, 1991, 0, 0,
0, 0), age = c(59, 63, 82, 49, 28, 13)), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
>
> library(ISLR); library(caret); library(caTools)
> options(scipen=999)
>
> set.seed(1031)
> #STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
> split = createDataPartition(y = houses$price,p = 0.7,list = F, groups = 100)
>
> train = houses[split,]
> test = houses[-split,]
>
> nrow(train)
[1] 15172
> nrow(test)
[1] 6441
> nrow(houses)
[1] 21613
>
> mean(train$price)
[1] 540674.2
> mean(test$price)
[1] 538707.6
我尝试使用 sample_frac
形式 dplyr
包和 Hmisc
包中的 cut2
函数手动复制它。结果几乎相同 - 仍然不相同。
看起来伪数生成器或某些舍入可能存在问题。
在我看来,您的代码看起来是正确的。
是否有可能在前面的步骤中,您应该以任何方式删除一些异常值或 pre-process 数据集。
library(caret)
options(scipen=999)
library(dplyr)
library(ggplot2) # to use diamonds dataset
library(Hmisc)
diamonds$index = 1:nrow(diamonds)
set.seed(1031)
# I use diamonds dataset from ggplot2 package
# g parameter (in cut2) - number of quantile groups
split = diamonds %>%
group_by(cut2(diamonds$price, g= 100)) %>%
sample_frac(0.7) %>%
pull(index)
train = diamonds[split,]
test = diamonds[-split,]
> mean(train$price)
[1] 3932.75
> mean(test$price)
[1] 3932.917
set.seed(1031)
#STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
split = createDataPartition(y = diamonds$price,p = 0.7,list = T, groups = 100)
train = diamonds[split$Resample1,]
test = diamonds[-split$Resample1,]
> mean(train$price)
[1] 3932.897
> mean(test$price)
[1] 3932.572
此抽样程序的结果应该接近总体。
我目前正在尝试在 R 中进行分层拆分以创建训练和测试数据集。 给我的问题如下
split the data into a train and test sample such that 70% of the data is in the train sample. To ensure a similar distribution of price across the train and test samples, use createDataPartition from the caret package. Set groups to 100 and use a seed of 1031. What is the average house price in the train sample?
数据集是一组带有价格的房屋(以及其他数据点)
出于某种原因,当我 运行 以下代码时,我得到的输出在练习题模拟器中被标记为不正确。谁能发现我的代码有问题?非常感谢任何帮助,因为我正在努力避免错误地学习这门语言。
dput(head(houses))
library(ISLR); library(caret); library(caTools)
options(scipen=999)
set.seed(1031)
#STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
split = createDataPartition(y = houses$price,p = 0.7,list = F, groups = 100)
train = houses[split,]
test = houses[-split,]
nrow(train)
nrow(test)
nrow(houses)
mean(train$price)
mean(test$price)
输出
> dput(head(houses))
structure(list(id = c(7129300520, 6414100192, 5631500400, 2487200875,
1954400510, 7237550310), price = c(221900, 538000, 180000, 604000,
510000, 1225000), bedrooms = c(3, 3, 2, 4, 3, 4), bathrooms = c(1,
2.25, 1, 3, 2, 4.5), sqft_living = c(1180, 2570, 770, 1960, 1680,
5420), sqft_lot = c(5650, 7242, 10000, 5000, 8080, 101930), floors = c(1,
2, 1, 1, 1, 1), waterfront = c(0, 0, 0, 0, 0, 0), view = c(0,
0, 0, 0, 0, 0), condition = c(3, 3, 3, 5, 3, 3), grade = c(7,
7, 6, 7, 8, 11), sqft_above = c(1180, 2170, 770, 1050, 1680,
3890), sqft_basement = c(0, 400, 0, 910, 0, 1530), yr_built = c(1955,
1951, 1933, 1965, 1987, 2001), yr_renovated = c(0, 1991, 0, 0,
0, 0), age = c(59, 63, 82, 49, 28, 13)), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
>
> library(ISLR); library(caret); library(caTools)
> options(scipen=999)
>
> set.seed(1031)
> #STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
> split = createDataPartition(y = houses$price,p = 0.7,list = F, groups = 100)
>
> train = houses[split,]
> test = houses[-split,]
>
> nrow(train)
[1] 15172
> nrow(test)
[1] 6441
> nrow(houses)
[1] 21613
>
> mean(train$price)
[1] 540674.2
> mean(test$price)
[1] 538707.6
我尝试使用 sample_frac
形式 dplyr
包和 Hmisc
包中的 cut2
函数手动复制它。结果几乎相同 - 仍然不相同。
看起来伪数生成器或某些舍入可能存在问题。
在我看来,您的代码看起来是正确的。
是否有可能在前面的步骤中,您应该以任何方式删除一些异常值或 pre-process 数据集。
library(caret)
options(scipen=999)
library(dplyr)
library(ggplot2) # to use diamonds dataset
library(Hmisc)
diamonds$index = 1:nrow(diamonds)
set.seed(1031)
# I use diamonds dataset from ggplot2 package
# g parameter (in cut2) - number of quantile groups
split = diamonds %>%
group_by(cut2(diamonds$price, g= 100)) %>%
sample_frac(0.7) %>%
pull(index)
train = diamonds[split,]
test = diamonds[-split,]
> mean(train$price)
[1] 3932.75
> mean(test$price)
[1] 3932.917
set.seed(1031)
#STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
split = createDataPartition(y = diamonds$price,p = 0.7,list = T, groups = 100)
train = diamonds[split$Resample1,]
test = diamonds[-split$Resample1,]
> mean(train$price)
[1] 3932.897
> mean(test$price)
[1] 3932.572
此抽样程序的结果应该接近总体。