R - Select 基于缺失数据减少的成对案例

R - Select pairwise cases based on reduction of missing data

我想弄清楚如何根据缺失的最佳组合对数据库进行子集化。

我的数据是这样的

   Country.Name X2010.x X2011.x X2012.x X2010.y X2011.y X2012.y
20      Belarus   15080   16410   16800   27.72   26.46      NA
21      Belgium   38810   40210   39870      NA      NA      NA
22       Belize    7720    7940    8170      NA      NA      NA
23        Benin    1590    1640    1710      NA      NA   43.53
24      Bermuda   69340   66640   66390      NA      NA      NA
25       Bhutan    6140    6680    6960      NA      NA   38.73
 ...............................................................

每个 year .x 与每个 year .y 一起选择 如果 .x.y 中有一个缺失,我就不能选择成对组合。

最后,我需要的是没有NA的数据库。为每个国家选择哪一年并不重要,.x.y 必须是同一年。

如果我查看 .x.y 之间的缺失分布,我可以看到选择 X2011 将是最好的组合。

colSums(is.na(data)) 
Country.Name      X2010.x      X2011.x      X2012.x      X2010.y      X2011.y      X2012.y 
       0            3            3            3           21           19           22 

但我想这是最好的组合整体但不是针对每个特定国家/地区。

我只需要保留数据中的最大国家数

我怎样才能根据特定的缺失案例最大限度地选择国家? 你明白我的问题吗?

有什么建议吗?

不是最优但可能的结果:

   Country.Name     .x     .y
20      Belarus   15080   27.72
31     Bulgaria   13950   35.78
35     Cambodia    2350   33.55
37       Canada   39200   33.68
45        China    9010   42.06

 # 
 data = select(data, Country.Name, X2010.x, X2010.y)
 data = na.omit(data)

数据集

data <- structure(list(Country.Name = c("Belarus", "Belgium", "Belize", 
  "Benin", "Bermuda", "Bhutan", "Bolivia", "Bosnia and Herzegovina", 
  "Botswana", "Brazil", "Brunei Darussalam", "Bulgaria", "Burkina Faso", 
  "Burundi", "Cabo Verde", "Cambodia", "Cameroon", "Canada", "Caribbean small states", 
  "Cayman Islands", "Central African Republic", "Central Europe and the Baltics", 
  "Chad", "Channel Islands", "Chile", "China"), X2010.x = c(15080, 
  38810, 7720, 1590, 69340, 6140, 4950, 8860, 12500, 13520, NA, 
  13950, 1390, 710, 5630, 2350, 2390, 39200, 13141.13583, NA, 880, 
  19213.13055, 1850, NA, 17010, 9010), X2011.x = c(16410, 40210, 
  7940, 1640, 66640, 6680, 5200, 9310, 13930, 14030, NA, 14790, 
  1430, 730, 5960, 2530, 2470, 40570, 12973.98051, NA, 910, 20391.27796, 
  1850, NA, 19040, 9940), X2012.x = c(16800, 39870, 8170, 1710, 
  66390, 6960, 5400, 9290, 14630, 14350, NA, 15250, 1550, 750, 
  6220, 2710, 2550, 41170, 13245.52928, NA, 950, 20765.62768, 1930, 
  NA, 20140, 10890), X2010.y = c(27.72, NA, NA, NA, NA, NA, NA, 
  NA, NA, NA, NA, 35.78, NA, NA, NA, 33.55, NA, 33.68, NA, NA, 
  NA, NA, NA, NA, NA, 42.06), X2011.y = c(26.46, NA, NA, NA, NA, 
  NA, 46.26, NA, NA, 53.09, NA, 34.28, NA, NA, NA, 31.82, NA, NA, 
  NA, NA, NA, NA, 43.3, NA, 50.84, NA), X2012.y = c(NA, NA, NA, 
  43.53, NA, 38.73, 46.64, NA, NA, 52.67, NA, NA, NA, NA, NA, NA, 
  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Country.Name", 
  "X2010.x", "X2011.x", "X2012.x", "X2010.y", "X2011.y", "X2012.y"
  ), row.names = 20:45, class = "data.frame")

这里有一个 dplyrtidyr 的解决方案:

data %>%
  gather(YearXY, Value, -Country.Name, na.rm = TRUE) %>%
  separate(YearXY, c("Year", "XY")) %>%
  spread(XY, Value) %>% filter(!is.na(x) & !is.na(y)) %>%
  group_by(Country.Name) %>%
  slice(1)

请注意,它省略了没有同时包含 x 和 y 的年份的国家/地区。

对于随机年份,将 slice(1) 替换为:

mutate(Random = sample(n())) %>%
  filter(Random == 1) %>%
  select(-Random)