R - Select 基于缺失数据减少的成对案例
R - Select pairwise cases based on reduction of missing data
我想弄清楚如何根据缺失的最佳组合对数据库进行子集化。
我的数据是这样的
Country.Name X2010.x X2011.x X2012.x X2010.y X2011.y X2012.y
20 Belarus 15080 16410 16800 27.72 26.46 NA
21 Belgium 38810 40210 39870 NA NA NA
22 Belize 7720 7940 8170 NA NA NA
23 Benin 1590 1640 1710 NA NA 43.53
24 Bermuda 69340 66640 66390 NA NA NA
25 Bhutan 6140 6680 6960 NA NA 38.73
...............................................................
每个 year .x
与每个 year .y
一起选择
如果 .x
或 .y
中有一个缺失,我就不能选择成对组合。
最后,我需要的是没有NA
的数据库。为每个国家选择哪一年并不重要,.x
和 .y
必须是同一年。
如果我查看 .x
和 .y
之间的缺失分布,我可以看到选择 X2011
将是最好的组合。
colSums(is.na(data))
Country.Name X2010.x X2011.x X2012.x X2010.y X2011.y X2012.y
0 3 3 3 21 19 22
但我想这是最好的组合整体但不是针对每个特定国家/地区。
我只需要保留数据中的最大国家数。
我怎样才能根据特定的缺失案例最大限度地选择国家?
你明白我的问题吗?
有什么建议吗?
不是最优但可能的结果:
Country.Name .x .y
20 Belarus 15080 27.72
31 Bulgaria 13950 35.78
35 Cambodia 2350 33.55
37 Canada 39200 33.68
45 China 9010 42.06
#
data = select(data, Country.Name, X2010.x, X2010.y)
data = na.omit(data)
数据集
data <- structure(list(Country.Name = c("Belarus", "Belgium", "Belize",
"Benin", "Bermuda", "Bhutan", "Bolivia", "Bosnia and Herzegovina",
"Botswana", "Brazil", "Brunei Darussalam", "Bulgaria", "Burkina Faso",
"Burundi", "Cabo Verde", "Cambodia", "Cameroon", "Canada", "Caribbean small states",
"Cayman Islands", "Central African Republic", "Central Europe and the Baltics",
"Chad", "Channel Islands", "Chile", "China"), X2010.x = c(15080,
38810, 7720, 1590, 69340, 6140, 4950, 8860, 12500, 13520, NA,
13950, 1390, 710, 5630, 2350, 2390, 39200, 13141.13583, NA, 880,
19213.13055, 1850, NA, 17010, 9010), X2011.x = c(16410, 40210,
7940, 1640, 66640, 6680, 5200, 9310, 13930, 14030, NA, 14790,
1430, 730, 5960, 2530, 2470, 40570, 12973.98051, NA, 910, 20391.27796,
1850, NA, 19040, 9940), X2012.x = c(16800, 39870, 8170, 1710,
66390, 6960, 5400, 9290, 14630, 14350, NA, 15250, 1550, 750,
6220, 2710, 2550, 41170, 13245.52928, NA, 950, 20765.62768, 1930,
NA, 20140, 10890), X2010.y = c(27.72, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, 35.78, NA, NA, NA, 33.55, NA, 33.68, NA, NA,
NA, NA, NA, NA, NA, 42.06), X2011.y = c(26.46, NA, NA, NA, NA,
NA, 46.26, NA, NA, 53.09, NA, 34.28, NA, NA, NA, 31.82, NA, NA,
NA, NA, NA, NA, 43.3, NA, 50.84, NA), X2012.y = c(NA, NA, NA,
43.53, NA, 38.73, 46.64, NA, NA, 52.67, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Country.Name",
"X2010.x", "X2011.x", "X2012.x", "X2010.y", "X2011.y", "X2012.y"
), row.names = 20:45, class = "data.frame")
这里有一个 dplyr
和 tidyr
的解决方案:
data %>%
gather(YearXY, Value, -Country.Name, na.rm = TRUE) %>%
separate(YearXY, c("Year", "XY")) %>%
spread(XY, Value) %>% filter(!is.na(x) & !is.na(y)) %>%
group_by(Country.Name) %>%
slice(1)
请注意,它省略了没有同时包含 x 和 y 的年份的国家/地区。
对于随机年份,将 slice(1)
替换为:
mutate(Random = sample(n())) %>%
filter(Random == 1) %>%
select(-Random)
我想弄清楚如何根据缺失的最佳组合对数据库进行子集化。
我的数据是这样的
Country.Name X2010.x X2011.x X2012.x X2010.y X2011.y X2012.y
20 Belarus 15080 16410 16800 27.72 26.46 NA
21 Belgium 38810 40210 39870 NA NA NA
22 Belize 7720 7940 8170 NA NA NA
23 Benin 1590 1640 1710 NA NA 43.53
24 Bermuda 69340 66640 66390 NA NA NA
25 Bhutan 6140 6680 6960 NA NA 38.73
...............................................................
每个 year .x
与每个 year .y
一起选择
如果 .x
或 .y
中有一个缺失,我就不能选择成对组合。
最后,我需要的是没有NA
的数据库。为每个国家选择哪一年并不重要,.x
和 .y
必须是同一年。
如果我查看 .x
和 .y
之间的缺失分布,我可以看到选择 X2011
将是最好的组合。
colSums(is.na(data))
Country.Name X2010.x X2011.x X2012.x X2010.y X2011.y X2012.y
0 3 3 3 21 19 22
但我想这是最好的组合整体但不是针对每个特定国家/地区。
我只需要保留数据中的最大国家数。
我怎样才能根据特定的缺失案例最大限度地选择国家? 你明白我的问题吗?
有什么建议吗?
不是最优但可能的结果:
Country.Name .x .y
20 Belarus 15080 27.72
31 Bulgaria 13950 35.78
35 Cambodia 2350 33.55
37 Canada 39200 33.68
45 China 9010 42.06
#
data = select(data, Country.Name, X2010.x, X2010.y)
data = na.omit(data)
数据集
data <- structure(list(Country.Name = c("Belarus", "Belgium", "Belize",
"Benin", "Bermuda", "Bhutan", "Bolivia", "Bosnia and Herzegovina",
"Botswana", "Brazil", "Brunei Darussalam", "Bulgaria", "Burkina Faso",
"Burundi", "Cabo Verde", "Cambodia", "Cameroon", "Canada", "Caribbean small states",
"Cayman Islands", "Central African Republic", "Central Europe and the Baltics",
"Chad", "Channel Islands", "Chile", "China"), X2010.x = c(15080,
38810, 7720, 1590, 69340, 6140, 4950, 8860, 12500, 13520, NA,
13950, 1390, 710, 5630, 2350, 2390, 39200, 13141.13583, NA, 880,
19213.13055, 1850, NA, 17010, 9010), X2011.x = c(16410, 40210,
7940, 1640, 66640, 6680, 5200, 9310, 13930, 14030, NA, 14790,
1430, 730, 5960, 2530, 2470, 40570, 12973.98051, NA, 910, 20391.27796,
1850, NA, 19040, 9940), X2012.x = c(16800, 39870, 8170, 1710,
66390, 6960, 5400, 9290, 14630, 14350, NA, 15250, 1550, 750,
6220, 2710, 2550, 41170, 13245.52928, NA, 950, 20765.62768, 1930,
NA, 20140, 10890), X2010.y = c(27.72, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, 35.78, NA, NA, NA, 33.55, NA, 33.68, NA, NA,
NA, NA, NA, NA, NA, 42.06), X2011.y = c(26.46, NA, NA, NA, NA,
NA, 46.26, NA, NA, 53.09, NA, 34.28, NA, NA, NA, 31.82, NA, NA,
NA, NA, NA, NA, 43.3, NA, 50.84, NA), X2012.y = c(NA, NA, NA,
43.53, NA, 38.73, 46.64, NA, NA, 52.67, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Country.Name",
"X2010.x", "X2011.x", "X2012.x", "X2010.y", "X2011.y", "X2012.y"
), row.names = 20:45, class = "data.frame")
这里有一个 dplyr
和 tidyr
的解决方案:
data %>%
gather(YearXY, Value, -Country.Name, na.rm = TRUE) %>%
separate(YearXY, c("Year", "XY")) %>%
spread(XY, Value) %>% filter(!is.na(x) & !is.na(y)) %>%
group_by(Country.Name) %>%
slice(1)
请注意,它省略了没有同时包含 x 和 y 的年份的国家/地区。
对于随机年份,将 slice(1)
替换为:
mutate(Random = sample(n())) %>%
filter(Random == 1) %>%
select(-Random)