如何 sample/partition 个人面板数据(最好使用插入符号库)?
How to sample/partition panel data by individuals( preferably with caret library)?
我想对面板数据进行分区并保留数据的面板性质:
library(caret)
library(mlbench)
#example panel data where id is the persons identifier over years
data <- read.table("http://people.stern.nyu.edu/wgreene/Econometrics/healthcare.csv",
header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
## Here for instance the dependent variable is working
inTrain <- createDataPartition(y = data$WORKING, p = .75,list = FALSE)
# subset into training
training <- data[ inTrain,]
# subset into testing
testing <- data[-inTrain,]
# Here we see some intersections of identifiers
str(training$id[10:20])
str(testing$id)
但是,我希望在对数据进行分区或采样时,避免将同一个人 (id) 分成两个数据 sets.Is 他们是一种从数据中随机 sample/partition 的方法将个体分配给相应的分区而不是观察值?
我试过采样:
mysample <- data[sample(unique(data$id), 1000,replace=FALSE),]
但是,这会破坏数据的面板性质...
我认为使用 sample()
的抽样方法存在一个小错误:它使用 id
变量作为行号。相反,该函数需要获取属于 ID 的所有行:
nID <- length(unique(data$id))
p = 0.75
set.seed(123)
inTrainID <- sample(unique(data$id), round(nID * p), replace=FALSE)
training <- data[data$id %in% inTrainID, ]
testing <- data[!data$id %in% inTrainID, ]
head(training[, 1:5], 10)
# id FEMALE YEAR AGE HANDDUM
# 1 1 0 1984 54 0.0000000
# 2 1 0 1985 55 0.0000000
# 3 1 0 1986 56 0.0000000
# 8 3 1 1984 58 0.1687193
# 9 3 1 1986 60 1.0000000
# 10 3 1 1987 61 0.0000000
# 11 3 1 1988 62 1.0000000
# 12 4 1 1985 29 0.0000000
# 13 5 0 1987 27 1.0000000
# 14 5 0 1988 28 0.0000000
dim(data)
# [1] 27326 41
dim(training)
# [1] 20566 41
dim(testing)
# [1] 6760 41
20566/27326
### 75.26% were selected for training
让我们检查一下 class 余额,因为 createDataPartition
会在所有集合中保持 WORKING 的 class 余额相等。
table(data$WORKING) / nrow(data)
# 0 1
# 0.3229525 0.6770475
#
table(training$WORKING) / nrow(training)
# 0 1
# 0.3226685 0.6773315
#
table(testing$WORKING) / nrow(testing)
# 0 1
# 0.3238166 0.6761834
### virtually equal
我想我会为所有查看此内容的人指出插入符号的 groupKFold 函数,这对于使用此 class 数据进行交叉验证会很方便。来自 documentation:
"要按组拆分数据,可以使用groupKFold:
set.seed(3527)
subjects <- sample(1:20, size = 80, replace = TRUE)
folds <- groupKFold(subjects, k = 15)
折叠的结果可以用作 trainControl 函数的索引参数的输入。"
我想对面板数据进行分区并保留数据的面板性质:
library(caret)
library(mlbench)
#example panel data where id is the persons identifier over years
data <- read.table("http://people.stern.nyu.edu/wgreene/Econometrics/healthcare.csv",
header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
## Here for instance the dependent variable is working
inTrain <- createDataPartition(y = data$WORKING, p = .75,list = FALSE)
# subset into training
training <- data[ inTrain,]
# subset into testing
testing <- data[-inTrain,]
# Here we see some intersections of identifiers
str(training$id[10:20])
str(testing$id)
但是,我希望在对数据进行分区或采样时,避免将同一个人 (id) 分成两个数据 sets.Is 他们是一种从数据中随机 sample/partition 的方法将个体分配给相应的分区而不是观察值?
我试过采样:
mysample <- data[sample(unique(data$id), 1000,replace=FALSE),]
但是,这会破坏数据的面板性质...
我认为使用 sample()
的抽样方法存在一个小错误:它使用 id
变量作为行号。相反,该函数需要获取属于 ID 的所有行:
nID <- length(unique(data$id))
p = 0.75
set.seed(123)
inTrainID <- sample(unique(data$id), round(nID * p), replace=FALSE)
training <- data[data$id %in% inTrainID, ]
testing <- data[!data$id %in% inTrainID, ]
head(training[, 1:5], 10)
# id FEMALE YEAR AGE HANDDUM
# 1 1 0 1984 54 0.0000000
# 2 1 0 1985 55 0.0000000
# 3 1 0 1986 56 0.0000000
# 8 3 1 1984 58 0.1687193
# 9 3 1 1986 60 1.0000000
# 10 3 1 1987 61 0.0000000
# 11 3 1 1988 62 1.0000000
# 12 4 1 1985 29 0.0000000
# 13 5 0 1987 27 1.0000000
# 14 5 0 1988 28 0.0000000
dim(data)
# [1] 27326 41
dim(training)
# [1] 20566 41
dim(testing)
# [1] 6760 41
20566/27326
### 75.26% were selected for training
让我们检查一下 class 余额,因为 createDataPartition
会在所有集合中保持 WORKING 的 class 余额相等。
table(data$WORKING) / nrow(data)
# 0 1
# 0.3229525 0.6770475
#
table(training$WORKING) / nrow(training)
# 0 1
# 0.3226685 0.6773315
#
table(testing$WORKING) / nrow(testing)
# 0 1
# 0.3238166 0.6761834
### virtually equal
我想我会为所有查看此内容的人指出插入符号的 groupKFold 函数,这对于使用此 class 数据进行交叉验证会很方便。来自 documentation: "要按组拆分数据,可以使用groupKFold:
set.seed(3527)
subjects <- sample(1:20, size = 80, replace = TRUE)
folds <- groupKFold(subjects, k = 15)
折叠的结果可以用作 trainControl 函数的索引参数的输入。"