如何提取具有因组而异的多个条件的随机样本？

Question

我有一个跨国数据集，其中每个受访者至少有一本日记。每个受访者的日记数量和日记完成日期因国家/地区而异。

例如，在一个国家/地区，每个受访者只完成了 1 篇日记（一半的受访者仅在周末完成，而另一半仅在工作日完成）。在另一个国家，每个受访者完成 2 篇日记（一个周末 - 一个工作日），而在另一个国家，每个人都完成 7 篇日记（一周中的每一天）。还有一些调查显示，一些受访者返回了 2 篇日记，而另一些则返回了 3 篇；还有那些每个人都返回 4 日记的地方。数据如下所示：

country_id<-rep(1:4,c(8,8,14,10))
diarist_id<-c(11:18,rep(21:24,each=2),
              rep(31:32,each=7),
              rep(41:44,c(3,3,2,2)))
diary_id<-c(111:118,211,212,221,222,231,232,241,242,
            311:317,321:327,411,412,413,
            421,422,423,431,432,441,442)
weekend<-c(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,
           0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,
           0,1,0,1,0,1,0,1,0)

dat<-data.frame(country_id,diarist_id,diary_id,weekend)

我想从每个国家随机抽取一个“一人一日记”样本。但在国家层面，我需要 - 大约 - 29% 的日记是周末日记。如何按组抽取这样的条件随机样本？

Answer 1

我认为这可以满足您的需求。为了清楚起见，我选择拆分样本；可能有一种方法可以在不这样做的情况下得到你想要的东西，但我没有想到。

我将使用 data.table:

set.seed(100)
library(data.table)
setDT(dat) #turn dat into a data.table (by reference)
country_n<-5 #how many observations you'd like per country

#split the data by weekend status
weekend.dat<-dat[weekend==T]
#we have to take care that there are actually enough
#  weekend observations in each country, so we take the
#  minimum of 29% of country_n (rounded) and the total
#  number of weekend observations in that country
weekend.sample<-
  weekend.dat[weekend.dat[,.I[sample(.N,min(round(.29*country_n),.N))],
                          by=country_id]$V1]

#repeat for the weekday sample, except take 71% this time
weekday.dat<-dat[weekend==F]
weekday.sample<-
  weekday.dat[weekday.dat[,.I[sample(.N,min(round(.71*country_n),.N))],
                          by=country_id]$V1]

#combine; setkey orders the data (as well as other
#  things that may be useful later on)
full.sample<-setkey(rbindlist(list(weekend.sample,weekday.sample)),
                    country_id,diarist_id,diary_id)

这是为我给定的随机种子生成的样本

> full.sample
    country_id diarist_id diary_id weekend
 1:          1         12      112       0
 2:          1         13      113       1
 3:          1         14      114       0
 4:          1         16      116       0
 5:          1         18      118       0
 6:          2         21      212       0
 7:          2         22      221       1
 8:          2         22      222       0
 9:          2         23      232       0
10:          2         24      242       0
11:          3         31      315       0
12:          3         31      316       0
13:          3         31      317       0
14:          3         32      321       1
15:          3         32      324       0
16:          4         41      411       1
17:          4         42      421       0
18:          4         42      423       0
19:          4         43      432       0
20:          4         44      442       0

如何提取具有因组而异的多个条件的随机样本？

How to extract a random sample with multiple conditions that vary by group?

r

random-sample