如何提取具有因组而异的多个条件的随机样本?
How to extract a random sample with multiple conditions that vary by group?
我有一个跨国数据集,其中每个受访者至少有一本日记。每个受访者的日记数量和日记完成日期因国家/地区而异。
例如,在一个国家/地区,每个受访者只完成了 1 篇日记(一半的受访者仅在周末完成,而另一半仅在工作日完成)。在另一个国家,每个受访者完成 2 篇日记(一个周末 - 一个工作日),而在另一个国家,每个人都完成 7 篇日记(一周中的每一天)。还有一些调查显示,一些受访者返回了 2 篇日记,而另一些则返回了 3 篇;还有那些每个人都返回 4 日记的地方。数据如下所示:
country_id<-rep(1:4,c(8,8,14,10))
diarist_id<-c(11:18,rep(21:24,each=2),
rep(31:32,each=7),
rep(41:44,c(3,3,2,2)))
diary_id<-c(111:118,211,212,221,222,231,232,241,242,
311:317,321:327,411,412,413,
421,422,423,431,432,441,442)
weekend<-c(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,
0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,
0,1,0,1,0,1,0,1,0)
dat<-data.frame(country_id,diarist_id,diary_id,weekend)
我想从每个国家随机抽取一个“一人一日记”样本。但在国家层面,我需要 - 大约 - 29% 的日记是周末日记。如何按组抽取这样的条件随机样本?
我认为这可以满足您的需求。为了清楚起见,我选择拆分样本;可能有一种方法可以在不这样做的情况下得到你想要的东西,但我没有想到。
我将使用 data.table
:
set.seed(100)
library(data.table)
setDT(dat) #turn dat into a data.table (by reference)
country_n<-5 #how many observations you'd like per country
#split the data by weekend status
weekend.dat<-dat[weekend==T]
#we have to take care that there are actually enough
# weekend observations in each country, so we take the
# minimum of 29% of country_n (rounded) and the total
# number of weekend observations in that country
weekend.sample<-
weekend.dat[weekend.dat[,.I[sample(.N,min(round(.29*country_n),.N))],
by=country_id]$V1]
#repeat for the weekday sample, except take 71% this time
weekday.dat<-dat[weekend==F]
weekday.sample<-
weekday.dat[weekday.dat[,.I[sample(.N,min(round(.71*country_n),.N))],
by=country_id]$V1]
#combine; setkey orders the data (as well as other
# things that may be useful later on)
full.sample<-setkey(rbindlist(list(weekend.sample,weekday.sample)),
country_id,diarist_id,diary_id)
这是为我给定的随机种子生成的样本
> full.sample
country_id diarist_id diary_id weekend
1: 1 12 112 0
2: 1 13 113 1
3: 1 14 114 0
4: 1 16 116 0
5: 1 18 118 0
6: 2 21 212 0
7: 2 22 221 1
8: 2 22 222 0
9: 2 23 232 0
10: 2 24 242 0
11: 3 31 315 0
12: 3 31 316 0
13: 3 31 317 0
14: 3 32 321 1
15: 3 32 324 0
16: 4 41 411 1
17: 4 42 421 0
18: 4 42 423 0
19: 4 43 432 0
20: 4 44 442 0
我有一个跨国数据集,其中每个受访者至少有一本日记。每个受访者的日记数量和日记完成日期因国家/地区而异。
例如,在一个国家/地区,每个受访者只完成了 1 篇日记(一半的受访者仅在周末完成,而另一半仅在工作日完成)。在另一个国家,每个受访者完成 2 篇日记(一个周末 - 一个工作日),而在另一个国家,每个人都完成 7 篇日记(一周中的每一天)。还有一些调查显示,一些受访者返回了 2 篇日记,而另一些则返回了 3 篇;还有那些每个人都返回 4 日记的地方。数据如下所示:
country_id<-rep(1:4,c(8,8,14,10))
diarist_id<-c(11:18,rep(21:24,each=2),
rep(31:32,each=7),
rep(41:44,c(3,3,2,2)))
diary_id<-c(111:118,211,212,221,222,231,232,241,242,
311:317,321:327,411,412,413,
421,422,423,431,432,441,442)
weekend<-c(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,
0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,
0,1,0,1,0,1,0,1,0)
dat<-data.frame(country_id,diarist_id,diary_id,weekend)
我想从每个国家随机抽取一个“一人一日记”样本。但在国家层面,我需要 - 大约 - 29% 的日记是周末日记。如何按组抽取这样的条件随机样本?
我认为这可以满足您的需求。为了清楚起见,我选择拆分样本;可能有一种方法可以在不这样做的情况下得到你想要的东西,但我没有想到。
我将使用 data.table
:
set.seed(100)
library(data.table)
setDT(dat) #turn dat into a data.table (by reference)
country_n<-5 #how many observations you'd like per country
#split the data by weekend status
weekend.dat<-dat[weekend==T]
#we have to take care that there are actually enough
# weekend observations in each country, so we take the
# minimum of 29% of country_n (rounded) and the total
# number of weekend observations in that country
weekend.sample<-
weekend.dat[weekend.dat[,.I[sample(.N,min(round(.29*country_n),.N))],
by=country_id]$V1]
#repeat for the weekday sample, except take 71% this time
weekday.dat<-dat[weekend==F]
weekday.sample<-
weekday.dat[weekday.dat[,.I[sample(.N,min(round(.71*country_n),.N))],
by=country_id]$V1]
#combine; setkey orders the data (as well as other
# things that may be useful later on)
full.sample<-setkey(rbindlist(list(weekend.sample,weekday.sample)),
country_id,diarist_id,diary_id)
这是为我给定的随机种子生成的样本
> full.sample
country_id diarist_id diary_id weekend
1: 1 12 112 0
2: 1 13 113 1
3: 1 14 114 0
4: 1 16 116 0
5: 1 18 118 0
6: 2 21 212 0
7: 2 22 221 1
8: 2 22 222 0
9: 2 23 232 0
10: 2 24 242 0
11: 3 31 315 0
12: 3 31 316 0
13: 3 31 317 0
14: 3 32 321 1
15: 3 32 324 0
16: 4 41 411 1
17: 4 42 421 0
18: 4 42 423 0
19: 4 43 432 0
20: 4 44 442 0