在大型数据集 R 上按 id 检查序列
Check sequences by id on a large data set R
我需要检查大型数据集中年份的值是否连续。
数据是这样的:
b <- c(2011,2012,2010, 2009:2011, 2013,2015,2017, 2010,2010, 2011)
dat <- data.frame(cbind(a,b))
dat
a b
1 1 2011
2 1 2012
3 1 2010
4 2 2009
5 2 2010
6 2 2011
7 3 2013
8 3 2015
9 3 2017
10 4 2010
11 4 2010
12 5 2011
这是我写的函数。它在小数据集上工作得很好。然而,真正的数据集非常大,有 200k 个 id,并且需要很长时间。我该怎么做才能让它更快?
seqyears <- function(id, year, idlist) {
year <- as.numeric(year)
year_values <- year[id==idlist]
year_sorted <- year_values[order(year_values)]
year_diff <- diff(year_sorted)
answer <- unique(year_diff)
if(length(answer)==0) {return("single line")} else { # length 0 means that there is only value and hence no diff can be computed
if(length(answer)==1 & answer==1) {return("sequence ok")} else {
return("check sequence")}}
}
获取值向量
unlist(lapply(c(1:5), FUN=seqyears, id=dat$a, year=dat$b))
我想你可以更简单地汇总这个。
aggregate(dat$b, dat[,"a",drop=FALSE], function(z) any(diff(sort(z)) != 1))
# a x
# 1 1 FALSE
# 2 2 FALSE
# 3 3 TRUE
# 4 4 TRUE
# 5 5 FALSE
如果您需要它是那个字符串,ifelse
可以满足您的需要:
aggregate(dat$b, dat[,"a",drop=FALSE],
function(z) ifelse(any(diff(sort(z)) != 1), "check sequence", "sequence ok"))
# a x
# 1 1 sequence ok
# 2 2 sequence ok
# 3 3 check sequence
# 4 4 check sequence
# 5 5 sequence ok
如果你有机会重复几年(这是可以接受的),那么你可以将内部匿名功能从diff(sort(z))
更改为diff(sort(unique(z)))
。
使用dplyr
library(dplyr)
dat %>%
arrange(a, z) %>%
group_by(a) %>%
summarise(x = case_when(any(z - lag(z) != 1) ~ 'check sequence',
TRUE ~ 'sequence ok'))
这也可能有效:
library(dplyr)
dat %>%
group_by(a) %>%
arrange(a,b) %>%
summarise(consecutive_sequence = ifelse(any(abs(b - lead(b)) ==1), TRUE, NA))
输出:
a consecutive_sequence
<dbl> <chr>
1 1 YES
2 2 YES
3 3 NA
4 4 NA
5 5 NA
一个data.table
选项
setorder(setDT(dat), a, b)[, .(x = c("check sequence", "sequence ok")[1 + all(diff(b) == 1)]), a]
给予
a x
1: 1 sequence ok
2: 2 sequence ok
3: 3 check sequence
4: 4 check sequence
5: 5 sequence ok
我需要检查大型数据集中年份的值是否连续。
数据是这样的:
b <- c(2011,2012,2010, 2009:2011, 2013,2015,2017, 2010,2010, 2011)
dat <- data.frame(cbind(a,b))
dat
a b
1 1 2011
2 1 2012
3 1 2010
4 2 2009
5 2 2010
6 2 2011
7 3 2013
8 3 2015
9 3 2017
10 4 2010
11 4 2010
12 5 2011
这是我写的函数。它在小数据集上工作得很好。然而,真正的数据集非常大,有 200k 个 id,并且需要很长时间。我该怎么做才能让它更快?
seqyears <- function(id, year, idlist) {
year <- as.numeric(year)
year_values <- year[id==idlist]
year_sorted <- year_values[order(year_values)]
year_diff <- diff(year_sorted)
answer <- unique(year_diff)
if(length(answer)==0) {return("single line")} else { # length 0 means that there is only value and hence no diff can be computed
if(length(answer)==1 & answer==1) {return("sequence ok")} else {
return("check sequence")}}
}
获取值向量
unlist(lapply(c(1:5), FUN=seqyears, id=dat$a, year=dat$b))
我想你可以更简单地汇总这个。
aggregate(dat$b, dat[,"a",drop=FALSE], function(z) any(diff(sort(z)) != 1))
# a x
# 1 1 FALSE
# 2 2 FALSE
# 3 3 TRUE
# 4 4 TRUE
# 5 5 FALSE
如果您需要它是那个字符串,ifelse
可以满足您的需要:
aggregate(dat$b, dat[,"a",drop=FALSE],
function(z) ifelse(any(diff(sort(z)) != 1), "check sequence", "sequence ok"))
# a x
# 1 1 sequence ok
# 2 2 sequence ok
# 3 3 check sequence
# 4 4 check sequence
# 5 5 sequence ok
如果你有机会重复几年(这是可以接受的),那么你可以将内部匿名功能从diff(sort(z))
更改为diff(sort(unique(z)))
。
使用dplyr
library(dplyr)
dat %>%
arrange(a, z) %>%
group_by(a) %>%
summarise(x = case_when(any(z - lag(z) != 1) ~ 'check sequence',
TRUE ~ 'sequence ok'))
这也可能有效:
library(dplyr)
dat %>%
group_by(a) %>%
arrange(a,b) %>%
summarise(consecutive_sequence = ifelse(any(abs(b - lead(b)) ==1), TRUE, NA))
输出:
a consecutive_sequence
<dbl> <chr>
1 1 YES
2 2 YES
3 3 NA
4 4 NA
5 5 NA
一个data.table
选项
setorder(setDT(dat), a, b)[, .(x = c("check sequence", "sequence ok")[1 + all(diff(b) == 1)]), a]
给予
a x
1: 1 sequence ok
2: 2 sequence ok
3: 3 check sequence
4: 4 check sequence
5: 5 sequence ok