如何通过加权抽样加入数据？

Question

我希望在两个数据集之间进行加权连接：

library(tidyverse)
set.seed(1)

test.sample <- data.frame(zip=sample(1:3,50,replace = TRUE))

index.dat <- data.frame(zip=c(1,1,2,3,3,3),
                  fips=c("A1", "A2", "B", "C1", "C2","C3"),
                  prob=c(.75,.25,1,.7,.2,.1))

我的预期输出是来自索引数据集的加权样本：

results1 <- c(rep("A1",14),rep("A2",4),rep("B",19,),rep("C1",9),rep("C2",3),"C3")

最终尝试从人口的概率分布中加入与多个 fips 代码匹配的邮政编码。

这条评论很好地描述了我要克服的问题：

这是我想出的一个潜在解决方案，但考虑到我有数十亿条记录，我需要性能更高的东西。

test_function <- function(x) {
index.dat %>% 
filter(zip == x) %>% 
sample_n(size=1,weight=prob) %>% 
select(fips)
}

results2 <- lapply(test.sample$zip, function(x) test_function(x)) %>% 
unlist() %>% 
data.frame(fips = .)

> table(results1)
results1
A1 A2  B C1 C2 C3 
14  4 19  9  3  1 
> table(results2)
results2
A1 A2  B C1 C2 C3 
15  3 19  8  2  3

Answer 1

您可以根据 zip 拆分 index.dat，为每个单独的邮政编码提供数据帧列表。如果您使用 test.sample$zip 对该列表进行子集化，您将获得包含 50 个数据框的列表以及相应的邮政编码。然后，您可以使用每个数据框的 prob 列中的权重对 fip 进行采样。

在你的情况下，它看起来像这样：

sample_space <- split(index.dat, index.dat$zip)[test.sample$zip]

test.sample$fips <- sapply(sample_space, 
                           function(x) sample(x$fips, 1, prob = x$prob))

现在 test.sample$fips 将从适当的邮政编码中随机选择一个 fip，并根据相对权重进行抽样。如果我们做一个 test.sampl$fips 的 table，我们可以看到比例大约是正确的：

table(test.sample$fips)

#> A1 A2  B C1 C2 
#> 13  5 19 10  3

zip 1 的 18 个成员已分配给 A1 和 A2，（几乎）75:25 拆分。正如预期的那样，zip 2 的所有成员都得到了 B，并且 zip 3 的 13 个成员已被适当分配（尽管由于概率低而偶然没有选择 C3）

如果 test.sample 有 5000 行，由于大数定律，我们会发现比例更接近预期的权重：

#>   A1   A2    B   C1   C2   C3 
#> 1257  419 1687 1153  325  159

如何通过加权抽样加入数据？

How to join data with a weighted sampling?

join

r

sample

dplyr