data matching/data 在长形数据库 r 中选择多个条件

Question

我已经为这个问题苦苦挣扎了一段时间，这是一个相当复杂的数据选择，有多种可能的输出，我找不到表达式来得到我想要的。我正在测量一群鸟的离婚率。

可重现的数据库：

nest<- rep(seq(1:10),2)
year<- c(rep(2014, 10), rep(2015, 10))
pair<- c("TH4327_TH4317", "2", "TH8522_T75390" ,"4", "TJ1704_TJ1703", "TH4335_TH4333",
         "7", "8", "TH4337_TH4323", "T74703_TH1797",
         "TH4327_TH4317", "12", "TH8522_T75550","14", "TJ1704_NA" , "TH4335_TH4333",           "17", "TH8715_TH8714", "TH4388_TH4323", "TE9639_TH9675")
test<- data.frame(nest, year, pair)
test$pair <- as.character(test$pair)
test$year <- as.character(test$year)

下划线分隔一对的2个成员的ID。当没有 ID 时，会放置越来越多的数字。显示每年相同的巢穴。在连续 2 年中，我们有 5 种可能的情况（数字是巢 ID）：

SAME PAIR 2014-2015: 1-6

EMPTY 2014-2015: 2-4-7

EMPTY 2014 but OCCUPIED 2015: 8

CHANGE OF PAIRS IN THE SAME NEST: 10

CHANGE OF ONE OF THE MEMBER OF THE PAIR: 3-9

UNKNOWN: 5

我追求的结果是：

在一起的对“2014-2015 年同一对”：2
其中一对改变了“改变一对成员之一”：2

我想出了如何计算保持在一起的对...

same<-test$pair[test$year=="2014"] %in% test$pair[test$year=="2015"]
table(same)

但是我无法获得离婚的信息。

我尝试了几个命令，which 和 ifelse，但没有成功。

如果有什么不清楚的地方，我很乐意给出进一步的解释。我知道这是一个相当混乱的问题。

非常感谢，祝一切顺利。

玩得开心

Answer 1

这是一种使用合并的方法。策略如下。首先将这些对分成 p1 和 p2（我用 tidyr::separate 做了这个）。然后，我跨年对数据进行子集化，并使用 p1 作为唯一标识符进行合并。这意味着现在将有两种不同的 p2，一种用于 2014 年，一种用于 2015 年。现在可以直接测试各组是否在一起或离婚。

如果你有很多年，这个方法将需要推广。如果需要，我很乐意提供这样的概括。

library(tidyr)

test <- 
test %>%
  filter(nchar(test$pair) > 3) %>% #getting rid of missing pairs
  separate(pair, c("p1", "p2"), "_") %>%
  select(-nest) #getting rid of nest which is superfluous 

test <- merge(test[test$year=="2014",], test[test$year=="2015",], by = "p1", all = TRUE)

#Same group across 2014 and 2015
na.omit(test[test$p2.x == test$p2.y, grep("p", names(test))])

#Different Group across 2014 and 2015
na.omit(test[test$p2.x != test$p2.y, grep("p", names(test))])

更新

要概括多年的代码，请使用以下代码。这是比循环更好的方法。另请注意，上面的代码不起作用，因为我忘记包含 dplyr 库。请务必下载并加载 dplyr 和 tidyr。这些库非常适合数据操作。以下是 tidyr and dplyr 上的一些来源。如果您还有其他问题，请告诉我。

library(tidyr)
library(dplyr)

test <- 
test %>%
  filter(nchar(test$pair) > 3) %>% #getting rid of missing pairs
  separate(pair, c("p1", "p2"), "_") %>% #splitting pairs
  select(-nest) #getting rid of nest which is superfluous 

test <- split(test, test$year) #split data into lists by year
test <- Map(function(d, n){names(d)[grepl("p2", names(d))] <- paste("p2", n, sep = "_"); d}, d = test, n = names(test)) #this line can be omitted.  It simply insures that your final data set looks nice.
test <- Reduce(function(...) merge(..., by = "p1", all = TRUE), test)

没有包（即在 Base R 中）

如果你不想使用 dplyr 和 tidyr 包你可以用这个基础替换前几行代码（直到调用 split 时） R 方法

test <- test[nchar(test$pair) > 3, !names(test)%in%"nest"]

split_pair <- do.call(rbind, strsplit(test$pair, "_"))

test$p1 <- split_pair[, 1]
test$p2 <- split_pair[, 2]
test <- test[, !names(test)%in%"pair"]

最终更新...希望

玩得开心在下面的评论中提出了一个很好的观点。由于我使用 p1 作为唯一标识符，因此无法识别 p2 何时更改。为了克服这个问题，我做了以下事情...

 test <- split(test, test$year) #split data into lists by year

 test <- Reduce(function(...) merge(..., by = c("p1", "p2"), all = TRUE), test) #merge on both p1 and p2 to overcome the previous problem.  Pair are now unique identifiers

#Stayed in same relationship
stay = test$year.x == "2014" & test$year.y == "2015"
na.omit(test[stay, ])

#p1 changes couples between year.x and year.y
tp1 <- test[test$p1 %in% test[duplicated(test$p1), "p1"], c("p1", "p2", "year.x", "year.y")]
is_na <- (is.na(tp1$year.x) & is.na(tp1$year.y))
stay_tp1 <- tp1$year.x == "2014" & tp1$year.y == "2015"
stay_tp1[is.na(stay_tp1)] <- FALSE
tp1 <- tp1[!(stay_tp1 | is_na), ]

#A similar approach works for p2.  Notice it is probably best to do this in a function.  If you do use a function remember you will need to pass your variables as strings, unless you want to use NSE.

最后一段代码可能有点混乱。让我解释。为了确定一只鸟是否改变伙伴，我们识别重复项，因为一只鸟从一对移动到另一对会出现两次。然而，在很多年的情况下，鸟可以在几年中的任何一年中改变配对。要确定鸟类变化的正确年份，您需要使用上面的代码。我建议你构造一个函数来处理这种情况，因为涉及到相当多的输入。

data matching/data 在长形数据库 r 中选择多个条件

data matching/data selection with multiple conditions in a long shaped database r

r

dataset

selection

更新

没有包（即在 Base R 中）

最终更新...希望