使用自己的规范在 R 中使用贝叶斯网络模拟数据

simulating data with bayesian network in R using own specification

假设我有一个简单的 DAG,表示混杂变量 X = 吸烟、治疗 T 和结果 Y = 死亡,这样:

T ~ X
Y~T+X

是否有可能生成一个由 1m 观察值组成的合成数据集,该数据集遵循一些指定的条件概率:

# Pr(smoking):  
smoking <- data.frame(
  smoking = c(0, 1),
  proba = c(0.7, 0.3)
)

# Pr(treatment | smoking):
treatment <- expand.grid(
  smoking = c(0, 1),
  treatment = c(0, 1)
) %>% arrange(smoking, treatment)
treatment$proba <- c(0.8,  0.2, 0.45, 0.55)

# Pr(death | treatment, smoking):
death <- expand.grid(
  treatment = c(0, 1),
  smoking = c(0,1),
  dead = c(0,1)
) %>% 
  arrange(treatment, smoking, dead)
death$proba <- c(0.9, 0.1, 0.2, 0.8, 0.89, 0.11, 0.5, 0.5)

我可以在这里手动执行此操作,因为它是一个非常基本的 DAG,但我想知道是否可以使用 bnlearn 之类的方法以另一种更具可扩展性的方式来完成它。

当前解决方案:

db <- data.frame(
  smoking = rbinom(n = 1000000, size = 1, prob = 0.3)
  ) 

db$treatment[db$smoking == 0] <- rbinom(n = sum(db$smoking == 0), size = 1, prob = 0.2)
db$treatment[db$smoking == 1] <- rbinom(n = sum(db$smoking == 1), size = 1, prob = 0.55)

db$dead[db$treatment == 0 & db$smoking == 0] <- rbinom(
  n = sum(db$treatment == 0 & db$smoking == 0), 
  size = 1, prob = 0.1
  )

db$dead[db$treatment == 0 & db$smoking == 1] <- rbinom(
  n = sum(db$treatment == 0 & db$smoking == 1), 
  size = 1, prob = 0.8
  )

db$dead[db$treatment == 1 & db$smoking == 0] <- rbinom(
  n = sum(db$treatment == 1 & db$smoking == 0), 
  size = 1, prob = 0.11
  )

db$dead[db$treatment == 1 & db$smoking == 1] <- rbinom(
  n = sum(db$treatment == 1 & db$smoking == 1), 
  size = 1, prob = 0.5
  )

让现有的包为您做这件事会更容易;喜欢 bnlearn。您可以使用 custom.fit 指定 DAG 和 CPT,然后使用 rbn 从中抽取样本。

一个例子

library(bnlearn)

# Specify DAG
net <- model2network("[treatment|smoking][smoking][death|treatment:smoking]")
graphviz.plot(net)

# Define CPTs
smoking <- matrix(c(0.7, 0.3), ncol = 2, dimnames = list(NULL, c("no", "yes")))
treatment <- matrix(c(0.8,  0.2, 0.45, 0.55), ncol = 2, dimnames = list(c("no", "yes"), c("no", "yes")))
death <- array(c(0.9, 0.1, 0.2, 0.8, 0.89, 0.11, 0.5, 0.5), c(2,2,2), dimnames=list(c("no", "yes"), c("no", "yes"), c("no", "yes")))

# Build BN
fit <- custom.fit(net, dist = list(smoking = smoking, treatment = treatment, death = death))

# Draw samples
set.seed(69395642)
samples <- rbn(fit, n=1e6)