使用两个人口统计变量(性别和年龄)重新采样数据以匹配人口概况(使用 R)
Re sampling a data to match the population profile using two demographic variables ( sex and age) (using R)
我正在为我想象中的 R 中的 multi-level 采样程序而苦苦挣扎。
假设我有一个由非常有偏见的抽样方法组成的数据集。因此,与参与者一起获得的结果是有偏见的。我想调整数据集以匹配两个人口统计变量(性别和年龄),这两个变量在数据集中被编码为因素。下图描述了这种情况。
我假设我需要执行“循环”计算。例如:要调整第一个年龄段 (15-19) 的样本量,我需要定义一个新的总数,其中这个最终样本符合 50% 50% 的定义。所有其他年龄段都需要相同的程序。
That's the most related topic I've found.
x<-structure(list(age_cat = c("25-29", "30-34", "25-29", "20-24",
"25-29", "20-24", "35-39", "30-34", "25-29", "30-34", "25-29",
"30-34", "35-39", "45-49", "40-45", "20-24", "20-24", "25-29",
"35-39", "35-39", "25-29", "20-24", "30-34", "30-34", "40-45",
"25-29", "25-29", "25-29", "20-24", "40-45", "20-24", "40-45",
"30-34", "25-29", "45-49", "30-34", "45-49", "40-45", "25-29",
"35-39", "40-45", "25-29", "45-49", "35-39", "45-49", "40-45",
"20-24", "45-49", "40-45", "25-29", "35-39", "30-34", "30-34",
"25-29", "20-24", "20-24", "40-45", "35-39", "25-29", "25-29",
"20-24", "40-45", "20-24", "20-24", "45-49", "20-24", "35-39",
"20-24", "35-39", "45-49", "15-19", "45-49", "35-39", "35-39",
"30-34", "35-39", "45-49", "35-39", "30-34", "20-24", "35-39",
"40-45", "40-45", "40-45", "30-34", "45-49", "20-24", "30-34",
"45-49", "35-39", "20-24", "20-24", "20-24", "45-49", "20-24",
"45-49", "35-39", "25-29", "40-45", "40-45", "25-29", "35-39",
"45-49", "30-34", "45-49", "45-49", "45-49", "15-19", "30-34",
"45-49", "30-34", "30-34", "35-39", "25-29", "40-45", "15-19",
"20-24", "20-24", "40-45", "40-45", "45-49", "45-49", "35-39",
"40-45", "30-34", "35-39", "35-39", "25-29", "25-29", "20-24",
"20-24", "40-45", "20-24", "35-39", "20-24", "20-24", "30-34",
"25-29", "45-49", "25-29", "35-39", "20-24", "35-39", "35-39",
"35-39", "40-45", "35-39", "35-39", "20-24", "30-34", "25-29",
"15-19", "30-34", "35-39", "15-19", "20-24", "20-24", "35-39",
"25-29", "25-29", "25-29", "25-29", "30-34", "40-45", "35-39",
"30-34", "35-39", "40-45", "25-29", "30-34", "25-29", "25-29",
"45-49", "30-34", "30-34", "25-29", "15-19", "25-29", "20-24",
"15-19", "20-24", "30-34", "20-24", "40-45", "25-29", "25-29",
"30-34", "30-34", "25-29", "20-24", "40-45", "45-49", "25-29",
"25-29", "40-45", "35-39", "25-29", "45-49", "35-39", "30-34",
"45-49", "30-34", "30-34", "45-49", "35-39", "20-24", "45-49",
"30-34", "25-29", "45-49", "45-49", "40-45", "25-29", "20-24",
"40-45", "30-34", "35-39", "30-34", "20-24", "35-39", "20-24",
"30-34", "20-24", "35-39", "35-39", "30-34", "45-49", "40-45",
"45-49", "25-29", "35-39", "40-45", "30-34", "35-39", "30-34",
"35-39", "20-24", "25-29", "35-39", "30-34", "30-34", "25-29",
"45-49", "45-49", "40-45", "40-45", "35-39", "30-34", "25-29",
"35-39", "20-24", "40-45", "20-24", "30-34", "40-45", "20-24",
"45-49", "20-24", "40-45", "25-29", "40-45", "25-29", "45-49",
"30-34", "30-34", "45-49", "40-45", "30-34", "30-34", "20-24",
"20-24", "35-39", "30-34", "15-19", "35-39", "25-29", "45-49",
"30-34", "25-29", "35-39", "15-19", "40-45", "45-49", "15-19",
"35-39", "45-49", "45-49", "25-29"), sex_cat = structure(c(1L,
2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L,
2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L,
2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L,
1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L,
1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L,
2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L,
1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L,
1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L,
2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L,
1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("M",
"F"), class = "factor")), row.names = c(NA, -288L), class = c("tbl_df",
"tbl", "data.frame"))
好吧,这真是太棒了!这是我所做的:
library(tidyverse)
library(data.table)
library(splitstackshape)
x <- x %>% mutate(id = row_number(),
sex_cats = paste("N", sex_cat, sep = "_"))
x_dt <- data.table(x)
x_cts <- x %>% group_by(age_cat, sex_cat) %>% summarise(n = n()) %>% ungroup(sex_cat)
x_raw <- data.frame(age_cat = rep(unique(x_cts$age_cat), each = 2),
sex_cat = rep(unique(x_cts$sex_cat), times = length(unique(x_cts$age_cat))),
percents = c(0.5, 0.5, 0.8, 0.2, 0.34, 0.66, 0.5, 0.5, 0.75, 0.25, 0.5, 0.5, 0.6, 0.4)
x_raw_wd <- x_raw %>% pivot_wider(names_from = sex_cat, values_from = percents, names_prefix = "per_")
x_raw_wd <- x_raw_wd %>% mutate(N_M = round(per_M * total_n),
N_F = round(per_F * total_n))
x_raw_wd$total_n <- c(6, 30, 30, 30, 20, 10, 20)
x_raw_wd_fin <- x_raw_wd %>%
select(age_cat, N_M, N_F) %>%
pivot_longer(cols = starts_with("N_"), names_to = "sex_cats") %>%
arrange(age_cat, sex_cats)
x_raw_wd_dt <- data.table(x_raw_wd_fin)
stratified(x_dt[, KEY := paste(age_cat, sex_cats)], "KEY", keep.rownames = T,
with(x_raw_wd_dt, setNames(value, paste(age_cat, sex_cats))))
有人比我更擅长使用 data.table
,但我在这里所做的是首先创建一个 id
列和 sex_cats
。 sex_cats
稍后使用,但暂时保留在这里。 x_cts
的创建是为了检查并确保您发送的数据已正确复制和粘贴。
然后我创建了 x_raw
,这是请求的模拟版本;在这里,我们为每个 age_cat
和 sex_cat
包含一个 percents
,每个 sex_cat
在每个 age_cat
中。这些加起来必须达到 100%。
然后我 pivot_wider
将 percents
转换为每个 sex_cat
的宽格式。然后我从每个 age_cat
模拟你想要的样本数量:这是手动插入的,所以如果你需要更改每个 age_cat
的数量,请随意。从这里我们为每个 sex_cat
计算 x_raw_wd
.
中的样本总数
由于 splitstackshape
中函数 stratified
的要求,我们以长格式得到它。如果您查看 names_to
选项,它会转移到 N_M
或 N_F
,这与 sex_cat
(sex_cat = 'M', 'F'
) 不同。这就是为什么一开始我们创建 sex_cats
.
最后,我们将所有内容提交到 stratified
。我们创建一个 KEY
列到 link 我们的 x_raw_wd_fin$value
,这是 age_cat
和 sex_cat
所需的样本总数,到 [=18= 的组合] 和 sex_cat
对于 x
.
中的每个观察
根据我的百分比,主要是为了演示目的而编造的,我需要 146 个样本。
这是我的输出:
age_cat sex_cat id paste("N", sex_cat) KEY sex_cats
1: 15-19 F 281 N F 15-19 N_F N_F
2: 15-19 F 155 N F 15-19 N_F N_F
3: 15-19 F 177 N F 15-19 N_F N_F
4: 15-19 M 108 N M 15-19 N_M N_M
5: 15-19 M 284 N M 15-19 N_M N_M
---
142: 45-49 M 105 N M 45-49 N_M N_M
143: 45-49 M 37 N M 45-49 N_M N_M
144: 45-49 M 207 N M 45-49 N_M N_M
145: 45-49 M 173 N M 45-49 N_M N_M
146: 45-49 M 103 N M 45-49 N_M N_M
我正在为我想象中的 R 中的 multi-level 采样程序而苦苦挣扎。 假设我有一个由非常有偏见的抽样方法组成的数据集。因此,与参与者一起获得的结果是有偏见的。我想调整数据集以匹配两个人口统计变量(性别和年龄),这两个变量在数据集中被编码为因素。下图描述了这种情况。
我假设我需要执行“循环”计算。例如:要调整第一个年龄段 (15-19) 的样本量,我需要定义一个新的总数,其中这个最终样本符合 50% 50% 的定义。所有其他年龄段都需要相同的程序。
That's the most related topic I've found.
x<-structure(list(age_cat = c("25-29", "30-34", "25-29", "20-24",
"25-29", "20-24", "35-39", "30-34", "25-29", "30-34", "25-29",
"30-34", "35-39", "45-49", "40-45", "20-24", "20-24", "25-29",
"35-39", "35-39", "25-29", "20-24", "30-34", "30-34", "40-45",
"25-29", "25-29", "25-29", "20-24", "40-45", "20-24", "40-45",
"30-34", "25-29", "45-49", "30-34", "45-49", "40-45", "25-29",
"35-39", "40-45", "25-29", "45-49", "35-39", "45-49", "40-45",
"20-24", "45-49", "40-45", "25-29", "35-39", "30-34", "30-34",
"25-29", "20-24", "20-24", "40-45", "35-39", "25-29", "25-29",
"20-24", "40-45", "20-24", "20-24", "45-49", "20-24", "35-39",
"20-24", "35-39", "45-49", "15-19", "45-49", "35-39", "35-39",
"30-34", "35-39", "45-49", "35-39", "30-34", "20-24", "35-39",
"40-45", "40-45", "40-45", "30-34", "45-49", "20-24", "30-34",
"45-49", "35-39", "20-24", "20-24", "20-24", "45-49", "20-24",
"45-49", "35-39", "25-29", "40-45", "40-45", "25-29", "35-39",
"45-49", "30-34", "45-49", "45-49", "45-49", "15-19", "30-34",
"45-49", "30-34", "30-34", "35-39", "25-29", "40-45", "15-19",
"20-24", "20-24", "40-45", "40-45", "45-49", "45-49", "35-39",
"40-45", "30-34", "35-39", "35-39", "25-29", "25-29", "20-24",
"20-24", "40-45", "20-24", "35-39", "20-24", "20-24", "30-34",
"25-29", "45-49", "25-29", "35-39", "20-24", "35-39", "35-39",
"35-39", "40-45", "35-39", "35-39", "20-24", "30-34", "25-29",
"15-19", "30-34", "35-39", "15-19", "20-24", "20-24", "35-39",
"25-29", "25-29", "25-29", "25-29", "30-34", "40-45", "35-39",
"30-34", "35-39", "40-45", "25-29", "30-34", "25-29", "25-29",
"45-49", "30-34", "30-34", "25-29", "15-19", "25-29", "20-24",
"15-19", "20-24", "30-34", "20-24", "40-45", "25-29", "25-29",
"30-34", "30-34", "25-29", "20-24", "40-45", "45-49", "25-29",
"25-29", "40-45", "35-39", "25-29", "45-49", "35-39", "30-34",
"45-49", "30-34", "30-34", "45-49", "35-39", "20-24", "45-49",
"30-34", "25-29", "45-49", "45-49", "40-45", "25-29", "20-24",
"40-45", "30-34", "35-39", "30-34", "20-24", "35-39", "20-24",
"30-34", "20-24", "35-39", "35-39", "30-34", "45-49", "40-45",
"45-49", "25-29", "35-39", "40-45", "30-34", "35-39", "30-34",
"35-39", "20-24", "25-29", "35-39", "30-34", "30-34", "25-29",
"45-49", "45-49", "40-45", "40-45", "35-39", "30-34", "25-29",
"35-39", "20-24", "40-45", "20-24", "30-34", "40-45", "20-24",
"45-49", "20-24", "40-45", "25-29", "40-45", "25-29", "45-49",
"30-34", "30-34", "45-49", "40-45", "30-34", "30-34", "20-24",
"20-24", "35-39", "30-34", "15-19", "35-39", "25-29", "45-49",
"30-34", "25-29", "35-39", "15-19", "40-45", "45-49", "15-19",
"35-39", "45-49", "45-49", "25-29"), sex_cat = structure(c(1L,
2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L,
2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L,
2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L,
1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L,
1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L,
2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L,
1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L,
1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L,
2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L,
1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("M",
"F"), class = "factor")), row.names = c(NA, -288L), class = c("tbl_df",
"tbl", "data.frame"))
好吧,这真是太棒了!这是我所做的:
library(tidyverse)
library(data.table)
library(splitstackshape)
x <- x %>% mutate(id = row_number(),
sex_cats = paste("N", sex_cat, sep = "_"))
x_dt <- data.table(x)
x_cts <- x %>% group_by(age_cat, sex_cat) %>% summarise(n = n()) %>% ungroup(sex_cat)
x_raw <- data.frame(age_cat = rep(unique(x_cts$age_cat), each = 2),
sex_cat = rep(unique(x_cts$sex_cat), times = length(unique(x_cts$age_cat))),
percents = c(0.5, 0.5, 0.8, 0.2, 0.34, 0.66, 0.5, 0.5, 0.75, 0.25, 0.5, 0.5, 0.6, 0.4)
x_raw_wd <- x_raw %>% pivot_wider(names_from = sex_cat, values_from = percents, names_prefix = "per_")
x_raw_wd <- x_raw_wd %>% mutate(N_M = round(per_M * total_n),
N_F = round(per_F * total_n))
x_raw_wd$total_n <- c(6, 30, 30, 30, 20, 10, 20)
x_raw_wd_fin <- x_raw_wd %>%
select(age_cat, N_M, N_F) %>%
pivot_longer(cols = starts_with("N_"), names_to = "sex_cats") %>%
arrange(age_cat, sex_cats)
x_raw_wd_dt <- data.table(x_raw_wd_fin)
stratified(x_dt[, KEY := paste(age_cat, sex_cats)], "KEY", keep.rownames = T,
with(x_raw_wd_dt, setNames(value, paste(age_cat, sex_cats))))
有人比我更擅长使用 data.table
,但我在这里所做的是首先创建一个 id
列和 sex_cats
。 sex_cats
稍后使用,但暂时保留在这里。 x_cts
的创建是为了检查并确保您发送的数据已正确复制和粘贴。
然后我创建了 x_raw
,这是请求的模拟版本;在这里,我们为每个 age_cat
和 sex_cat
包含一个 percents
,每个 sex_cat
在每个 age_cat
中。这些加起来必须达到 100%。
然后我 pivot_wider
将 percents
转换为每个 sex_cat
的宽格式。然后我从每个 age_cat
模拟你想要的样本数量:这是手动插入的,所以如果你需要更改每个 age_cat
的数量,请随意。从这里我们为每个 sex_cat
计算 x_raw_wd
.
由于 splitstackshape
中函数 stratified
的要求,我们以长格式得到它。如果您查看 names_to
选项,它会转移到 N_M
或 N_F
,这与 sex_cat
(sex_cat = 'M', 'F'
) 不同。这就是为什么一开始我们创建 sex_cats
.
最后,我们将所有内容提交到 stratified
。我们创建一个 KEY
列到 link 我们的 x_raw_wd_fin$value
,这是 age_cat
和 sex_cat
所需的样本总数,到 [=18= 的组合] 和 sex_cat
对于 x
.
根据我的百分比,主要是为了演示目的而编造的,我需要 146 个样本。
这是我的输出:
age_cat sex_cat id paste("N", sex_cat) KEY sex_cats
1: 15-19 F 281 N F 15-19 N_F N_F
2: 15-19 F 155 N F 15-19 N_F N_F
3: 15-19 F 177 N F 15-19 N_F N_F
4: 15-19 M 108 N M 15-19 N_M N_M
5: 15-19 M 284 N M 15-19 N_M N_M
---
142: 45-49 M 105 N M 45-49 N_M N_M
143: 45-49 M 37 N M 45-49 N_M N_M
144: 45-49 M 207 N M 45-49 N_M N_M
145: 45-49 M 173 N M 45-49 N_M N_M
146: 45-49 M 103 N M 45-49 N_M N_M