在 R 中识别二元数据中的匹配观察

Question

大家好，

我正在努力解决以下问题。目前，我的数据集如下所示：

    
   living_in     from           Year    stock
   Austria       Australia      2014       2513
   Austria       Australia      2013       2000
   Germany       Austria       2010       6000
   Australia     Austria       2014       3000
   Austria       Australia     1993        NA

现在我想确定满足以下条件的所有观察结果：

应该是同年
应该包含当年的相同国家对
不应包含 NA

例如，我想找到在同一年内对奥地利-澳大利亚和澳大利亚-奥地利等两个国家组合的所有观测值] 包含值。这是因为 数据集中给定年份的某些组合只有一个股票值，而不是两个 。我想删除那些。

这里进行的最佳方式是什么？非常感谢！

P.S。我的数据集中大约有 14 个国家/地区对需要这种识别

有用的输出可能是这样的。

    
   living_in     from           Year    stock       dummy
   Austria       Australia      2014       2513       1
   Austria       Australia      2013       2000       0
   Germany       Austria       2010       6000        0
   Australia     Austria       2014       3000        1
   Austria       Australia     1993        NA         0

Answer 1

对于每个国家/地区的组合，无论其顺序如何（A-B 与 B-A 相同）如果对于相同的 Year 它有超过 1 行并且 all stock 值是非 NA 或分配 0.

library(dplyr)

df %>%
  group_by(col1 = pmin(living_in, from), col2 = pmax(living_in, from), Year) %>%
  mutate(dummy = as.integer(n() > 1 && all(!is.na(stock)))) %>%
  ungroup %>%
  select(-col1, -col2)

#  living_in from       Year stock dummy
#  <chr>     <chr>     <int> <int> <int>
#1 Austria   Australia  2014  2513     1
#2 Austria   Australia  2013  2000     0
#3 Germany   Austria    2010  6000     0
#4 Australia Austria    2014  3000     1
#5 Austria   Australia  1993    NA     0

数据

df <- structure(list(living_in = c("Austria", "Austria", "Germany", 
"Australia", "Austria"), from = c("Australia", "Australia", "Austria", 
"Austria", "Australia"), Year = c(2014L, 2013L, 2010L, 2014L, 
1993L), stock = c(2513L, 2000L, 6000L, 3000L, NA)), 
class = "data.frame", row.names = c(NA, -5L))

在 R 中识别二元数据中的匹配观察

Identifying matching observations in dyadic data in R

r

dataframe

dplyr

data-cleaning

data-transform