R - 根据配对数据填充缺失信息

R - fill missing information based on paired data

我正在尝试根据相关个人 (couple) 的信息来填补缺失的案例。

我的数据是这样的

   hserial    sex age     children
1  1001041   Male  30          Yes
2  1001041 Female  32          Yes
3  1001061   Male  22           No
4  1001061 Female  21           No
5  1001091   Male  38          Yes
6  1001091 Female  37          Yes
7  1001151   Male  31           No
8  1001151 Female  27 Not eligible
9  1001161   Male  33          Yes
10 1001161 Female  35          Yes

所以hserialcouple标识符。第 8 行缺少一个案例 Not eligible,但该信息可从合作伙伴处获得(第 7 行)。

我正在尝试找到一种巧妙的方法来用合作伙伴的信息来填补这些缺失。

我正在考虑做类似

的事情
library(dplyr) 

childsum = dta %>% group_by(hserial, sex, children) %>% 
summarise(n = n()) %>% spread(sex, children) 

我会得到

  hserial n Male       Female
1 1001041 1  Yes          Yes
2 1001061 1   No           No
3 1001091 1  Yes          Yes
4 1001151 1   No Not eligible
5 1001161 1  Yes          Yes

然后我可以做类似

的事情
childsum$Male = ifelse(childsum$Male == 'Not eligible', childsum$Female, childsum$Male)
childsum$Female = ifelse(childsum$Female == 'Not eligible', childsum$Male, childsum$Female)

因此,对于 Male 的每个缺失,请填写 Female 信息, 反之亦然。 然后合并返回结果,以便得到

   hserial    sex age     children
1  1001041   Male  30          Yes
2  1001041 Female  32          Yes
3  1001061   Male  22           No
4  1001061 Female  21           No
5  1001091   Male  38          Yes
6  1001091 Female  37          Yes
7  1001151   Male  31           No
8  1001151 Female  27           No
9  1001161   Male  33          Yes
10 1001161 Female  35          Yes

知道如何做到这一点是一种巧妙的方法吗?

dta = structure(list(hserial = c(1001041, 1001041, 1001061, 1001061, 
1001091, 1001091, 1001151, 1001151, 1001161, 1001161), sex = structure(c(1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Male", "Female"
), class = "factor"), age = c(30, 32, 22, 21, 38, 37, 31, 27, 
33, 35), children = structure(c(5L, 5L, 6L, 6L, 5L, 5L, 6L, 4L, 
5L, 5L), .Label = c("DNA Does not apply", "NA No answer", "NA No answer", 
"Not eligible", "Yes", "No"), class = "factor")), class = "data.frame", .Names = c("hserial", 
"sex", "age", "children"), row.names = c(NA, -10L))

这是一种方法,它假设任何一对(由两个 hserial 组成)应该始终在 children 中具有相同的 yes/no 条目,除非两个人都具有 Not eligible条目。因此,它计算每对夫妇的 setdiff 可用 children 信息和 Not eligible。在所有(两个)条目都是 "Not eligible" 的情况下,它是 returns NA,因为我认为这是处理缺失值的更好方法(如您所知,您可以使用许多专门的函数对于 NA,您不能对 Not eligible 个条目使用相同的方法。

dta %>% 
  group_by(hserial) %>% 
  mutate(children = if(all(children == "Not eligible")) NA_character_ else 
                       setdiff(children, "Not eligible"))
#Source: local data frame [10 x 4]
#Groups: hserial [5]
#
#   hserial    sex   age children
#     (dbl) (fctr) (dbl)    (chr)
#1  1001041   Male    30      Yes
#2  1001041 Female    32      Yes
#3  1001061   Male    22       No
#4  1001061 Female    21       No
#5  1001091   Male    38      Yes
#6  1001091 Female    37      Yes
#7  1001151   Male    31       No
#8  1001151 Female    27       No
#9  1001161   Male    33      Yes
#10 1001161 Female    35      Yes