计算数据框中匹配动态条件的行数

Count Number of Rows in a Dataframe that Match Dynamic Conditions

我有这样一个数据框:

ID <- c("AB1","AB1","CD2","AB3","KK4","AB3","AB3","AB1","AB1","CD2")
year <- c(2005,2008,2005,2010,2007,2009,2009,2007,2000,2010)
df <- data.frame(ID, year)
df
    ID year
1  AB1 2005
2  AB1 2008
3  CD2 2005
4  AB3 2010
5  KK4 2007
6  AB3 2009
7  AB3 2009
8  AB1 2007
9  AB1 2000
10 CD2 2010

我想添加一个 xp 列,其中包含与 ID 匹配且年份值小于当前行的行数。我正在寻找类似的东西:

df$xp <- nrow( ID == "ID in current row" & year < "year in current row" )

结果应该是:

    ID year xp
1  AB1 2005  1
2  AB1 2008  3
3  CD2 2005  0
4  AB3 2010  2
5  KK4 2007  0
6  AB3 2009  0
7  AB3 2009  0
8  AB1 2007  2
9  AB1 2000  0
10 CD2 2010  1

我确信有更简洁的基础 R 或 data.table 方法,但这里有一种使用 dplyr 和 tidyr 的方法。这种方法依赖于“non-equi 连接”,dplyr 目前不包括(但 data.tablesqldf 包括),所以我正在做笛卡尔连接然后过滤,这对于大数据来说效率会降低。

library(dplyr);library(tidyr)

left_join(                        # join...
  df,                             # each row of df...
  df %>%                          # with each matching row of a table where...
    left_join(df, by = "ID") %>%  # each row of df is joined to all the rows with same ID
    filter(year.y < year.x) %>%   # and we only keep preceding years
    count(ID, year = year.x),     # and we count how many there are per ID
  by = c("ID", "year")) %>%
  replace_na(list(n=0))           # and we replace the NA's with zeroes

这是一个data.table解决方案:

library(data.table)
setDT(df)
df[, xp:=sapply(1:.N, \(x) sum(year < year[x])), by=ID][]
#>      ID year xp
#>  1: AB1 2005  1
#>  2: AB1 2008  3
#>  3: CD2 2005  0
#>  4: AB3 2010  2
#>  5: KK4 2007  0
#>  6: AB3 2009  0
#>  7: AB3 2009  0
#>  8: AB1 2007  2
#>  9: AB1 2000  0
#> 10: CD2 2010  1

这是使用 dplyrpurrr 的方法:

library(dplyr)
library(purrr)

df %>%
    group_by(ID) %>%
    mutate(xp = map_int(year, function(x) sum(cur_data()$year < x)))

purrr::map_intyear 列的所有元素运行匿名函数。 dplyr::cur_data() returns当前组的数据为数据框

问题中的伪代码几乎直接翻译成SQL。我们执行满足指定条件的 df 的左自连接。我们按行分组并计算 non-null joined-to 个元素。

library(sqldf)

sqldf("select a.*, count(b.ID) xp
  from df a 
  left join df b on a.ID = b.ID and b.year < a.year
  group by a.rowid")