差异数据框列的每个子集
Diff on each subset of a data frame column
我有一个包含 ID、年份和月份的数据框。我需要按年和月分组并从该组中获取唯一 ID。我想将唯一ID与上年、月组进行比较,增加了多少个ID,减去了多少个。
有点像在黑暗中拍摄,但我尝试了以下方法,但不起作用:
connections <- df %>%
group_by(year, month) %>%
arrange(year, month) %>%
diff_data(unique(as.vector(~ID)), lag(unique(as.vector(~ID))))
示例数据
df <- data.frame(ID=c("A1", "A2", "A3", "A1", "A2","A4", "A1", "A4", "A5"),
year= c(2010, 2010, 2010, 2011, 2011, 2011, 2012, 2012, 2012),
month= c(1, 2, 3, 1, 2, 3, 1, 2, 3))
Desired Output
首先会在月份和年份执行 aggregate
。在这种方法中,将列出每个月添加和删除的所有 ID,并获取 length
来计算每个月添加和删除的数量。
library(tidyverse)
df %>%
aggregate(ID ~ year + month, ., unique, drop = FALSE) %>%
group_by(month) %>%
arrange(year) %>%
mutate(addedID = mapply(setdiff, ID, lag(ID), SIMPLIFY = FALSE),
num_addedID = lapply(addedID, length),
deletedID = mapply(setdiff, lag(ID), ID, SIMPLIFY = FALSE),
num_deletedID = lapply(deletedID, function(x) length(na.omit(x)))) %>%
ungroup() %>%
arrange(month, year) %>%
as.data.frame()
输出
year month ID addedID num_addedID deletedID num_deletedID
1 2010 1 A1 A1 1 NA 0
2 2011 1 A1 0 0
3 2012 1 A1 0 0
4 2010 2 A3 A3 1 NA 0
5 2011 2 A2 A2 1 A3 1
6 2012 2 A4 A4 1 A2 1
7 2010 3 A3 A3 1 NA 0
8 2011 3 A4 A4 1 A3 1
9 2012 3 A5 A5 1 A4 1
数据
df <- data.frame(ID=c("A1", "A3", "A3", "A1", "A2","A4", "A1", "A4", "A5"),
year= c(2010, 2010, 2010, 2011, 2011, 2011, 2012, 2012, 2012),
month= c(1, 2, 3, 1, 2, 3, 1, 2, 3))
我有一个包含 ID、年份和月份的数据框。我需要按年和月分组并从该组中获取唯一 ID。我想将唯一ID与上年、月组进行比较,增加了多少个ID,减去了多少个。
有点像在黑暗中拍摄,但我尝试了以下方法,但不起作用:
connections <- df %>%
group_by(year, month) %>%
arrange(year, month) %>%
diff_data(unique(as.vector(~ID)), lag(unique(as.vector(~ID))))
示例数据
df <- data.frame(ID=c("A1", "A2", "A3", "A1", "A2","A4", "A1", "A4", "A5"),
year= c(2010, 2010, 2010, 2011, 2011, 2011, 2012, 2012, 2012),
month= c(1, 2, 3, 1, 2, 3, 1, 2, 3))
Desired Output
首先会在月份和年份执行 aggregate
。在这种方法中,将列出每个月添加和删除的所有 ID,并获取 length
来计算每个月添加和删除的数量。
library(tidyverse)
df %>%
aggregate(ID ~ year + month, ., unique, drop = FALSE) %>%
group_by(month) %>%
arrange(year) %>%
mutate(addedID = mapply(setdiff, ID, lag(ID), SIMPLIFY = FALSE),
num_addedID = lapply(addedID, length),
deletedID = mapply(setdiff, lag(ID), ID, SIMPLIFY = FALSE),
num_deletedID = lapply(deletedID, function(x) length(na.omit(x)))) %>%
ungroup() %>%
arrange(month, year) %>%
as.data.frame()
输出
year month ID addedID num_addedID deletedID num_deletedID
1 2010 1 A1 A1 1 NA 0
2 2011 1 A1 0 0
3 2012 1 A1 0 0
4 2010 2 A3 A3 1 NA 0
5 2011 2 A2 A2 1 A3 1
6 2012 2 A4 A4 1 A2 1
7 2010 3 A3 A3 1 NA 0
8 2011 3 A4 A4 1 A3 1
9 2012 3 A5 A5 1 A4 1
数据
df <- data.frame(ID=c("A1", "A3", "A3", "A1", "A2","A4", "A1", "A4", "A5"),
year= c(2010, 2010, 2010, 2011, 2011, 2011, 2012, 2012, 2012),
month= c(1, 2, 3, 1, 2, 3, 1, 2, 3))