R - 根据配对数据填充缺失信息
R - fill missing information based on paired data
我正在尝试根据相关个人 (couple) 的信息来填补缺失的案例。
我的数据是这样的
hserial sex age children
1 1001041 Male 30 Yes
2 1001041 Female 32 Yes
3 1001061 Male 22 No
4 1001061 Female 21 No
5 1001091 Male 38 Yes
6 1001091 Female 37 Yes
7 1001151 Male 31 No
8 1001151 Female 27 Not eligible
9 1001161 Male 33 Yes
10 1001161 Female 35 Yes
所以hserial
是couple标识符。第 8 行缺少一个案例 Not eligible
,但该信息可从合作伙伴处获得(第 7 行)。
我正在尝试找到一种巧妙的方法来用合作伙伴的信息来填补这些缺失。
我正在考虑做类似
的事情
library(dplyr)
childsum = dta %>% group_by(hserial, sex, children) %>%
summarise(n = n()) %>% spread(sex, children)
我会得到
hserial n Male Female
1 1001041 1 Yes Yes
2 1001061 1 No No
3 1001091 1 Yes Yes
4 1001151 1 No Not eligible
5 1001161 1 Yes Yes
然后我可以做类似
的事情
childsum$Male = ifelse(childsum$Male == 'Not eligible', childsum$Female, childsum$Male)
childsum$Female = ifelse(childsum$Female == 'Not eligible', childsum$Male, childsum$Female)
因此,对于 Male
的每个缺失,请填写 Female
信息, 反之亦然。
然后合并返回结果,以便得到
hserial sex age children
1 1001041 Male 30 Yes
2 1001041 Female 32 Yes
3 1001061 Male 22 No
4 1001061 Female 21 No
5 1001091 Male 38 Yes
6 1001091 Female 37 Yes
7 1001151 Male 31 No
8 1001151 Female 27 No
9 1001161 Male 33 Yes
10 1001161 Female 35 Yes
知道如何做到这一点是一种巧妙的方法吗?
dta = structure(list(hserial = c(1001041, 1001041, 1001061, 1001061,
1001091, 1001091, 1001151, 1001151, 1001161, 1001161), sex = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Male", "Female"
), class = "factor"), age = c(30, 32, 22, 21, 38, 37, 31, 27,
33, 35), children = structure(c(5L, 5L, 6L, 6L, 5L, 5L, 6L, 4L,
5L, 5L), .Label = c("DNA Does not apply", "NA No answer", "NA No answer",
"Not eligible", "Yes", "No"), class = "factor")), class = "data.frame", .Names = c("hserial",
"sex", "age", "children"), row.names = c(NA, -10L))
这是一种方法,它假设任何一对(由两个 hserial
组成)应该始终在 children
中具有相同的 yes/no 条目,除非两个人都具有 Not eligible
条目。因此,它计算每对夫妇的 setdiff
可用 children
信息和 Not eligible
。在所有(两个)条目都是 "Not eligible" 的情况下,它是 returns NA
,因为我认为这是处理缺失值的更好方法(如您所知,您可以使用许多专门的函数对于 NA
,您不能对 Not eligible
个条目使用相同的方法。
dta %>%
group_by(hserial) %>%
mutate(children = if(all(children == "Not eligible")) NA_character_ else
setdiff(children, "Not eligible"))
#Source: local data frame [10 x 4]
#Groups: hserial [5]
#
# hserial sex age children
# (dbl) (fctr) (dbl) (chr)
#1 1001041 Male 30 Yes
#2 1001041 Female 32 Yes
#3 1001061 Male 22 No
#4 1001061 Female 21 No
#5 1001091 Male 38 Yes
#6 1001091 Female 37 Yes
#7 1001151 Male 31 No
#8 1001151 Female 27 No
#9 1001161 Male 33 Yes
#10 1001161 Female 35 Yes
我正在尝试根据相关个人 (couple) 的信息来填补缺失的案例。
我的数据是这样的
hserial sex age children
1 1001041 Male 30 Yes
2 1001041 Female 32 Yes
3 1001061 Male 22 No
4 1001061 Female 21 No
5 1001091 Male 38 Yes
6 1001091 Female 37 Yes
7 1001151 Male 31 No
8 1001151 Female 27 Not eligible
9 1001161 Male 33 Yes
10 1001161 Female 35 Yes
所以hserial
是couple标识符。第 8 行缺少一个案例 Not eligible
,但该信息可从合作伙伴处获得(第 7 行)。
我正在尝试找到一种巧妙的方法来用合作伙伴的信息来填补这些缺失。
我正在考虑做类似
的事情library(dplyr)
childsum = dta %>% group_by(hserial, sex, children) %>%
summarise(n = n()) %>% spread(sex, children)
我会得到
hserial n Male Female
1 1001041 1 Yes Yes
2 1001061 1 No No
3 1001091 1 Yes Yes
4 1001151 1 No Not eligible
5 1001161 1 Yes Yes
然后我可以做类似
的事情childsum$Male = ifelse(childsum$Male == 'Not eligible', childsum$Female, childsum$Male)
childsum$Female = ifelse(childsum$Female == 'Not eligible', childsum$Male, childsum$Female)
因此,对于 Male
的每个缺失,请填写 Female
信息, 反之亦然。
然后合并返回结果,以便得到
hserial sex age children
1 1001041 Male 30 Yes
2 1001041 Female 32 Yes
3 1001061 Male 22 No
4 1001061 Female 21 No
5 1001091 Male 38 Yes
6 1001091 Female 37 Yes
7 1001151 Male 31 No
8 1001151 Female 27 No
9 1001161 Male 33 Yes
10 1001161 Female 35 Yes
知道如何做到这一点是一种巧妙的方法吗?
dta = structure(list(hserial = c(1001041, 1001041, 1001061, 1001061,
1001091, 1001091, 1001151, 1001151, 1001161, 1001161), sex = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Male", "Female"
), class = "factor"), age = c(30, 32, 22, 21, 38, 37, 31, 27,
33, 35), children = structure(c(5L, 5L, 6L, 6L, 5L, 5L, 6L, 4L,
5L, 5L), .Label = c("DNA Does not apply", "NA No answer", "NA No answer",
"Not eligible", "Yes", "No"), class = "factor")), class = "data.frame", .Names = c("hserial",
"sex", "age", "children"), row.names = c(NA, -10L))
这是一种方法,它假设任何一对(由两个 hserial
组成)应该始终在 children
中具有相同的 yes/no 条目,除非两个人都具有 Not eligible
条目。因此,它计算每对夫妇的 setdiff
可用 children
信息和 Not eligible
。在所有(两个)条目都是 "Not eligible" 的情况下,它是 returns NA
,因为我认为这是处理缺失值的更好方法(如您所知,您可以使用许多专门的函数对于 NA
,您不能对 Not eligible
个条目使用相同的方法。
dta %>%
group_by(hserial) %>%
mutate(children = if(all(children == "Not eligible")) NA_character_ else
setdiff(children, "Not eligible"))
#Source: local data frame [10 x 4]
#Groups: hserial [5]
#
# hserial sex age children
# (dbl) (fctr) (dbl) (chr)
#1 1001041 Male 30 Yes
#2 1001041 Female 32 Yes
#3 1001061 Male 22 No
#4 1001061 Female 21 No
#5 1001091 Male 38 Yes
#6 1001091 Female 37 Yes
#7 1001151 Male 31 No
#8 1001151 Female 27 No
#9 1001161 Male 33 Yes
#10 1001161 Female 35 Yes