计算数据框中匹配动态条件的行数
Count Number of Rows in a Dataframe that Match Dynamic Conditions
我有这样一个数据框:
ID <- c("AB1","AB1","CD2","AB3","KK4","AB3","AB3","AB1","AB1","CD2")
year <- c(2005,2008,2005,2010,2007,2009,2009,2007,2000,2010)
df <- data.frame(ID, year)
df
ID year
1 AB1 2005
2 AB1 2008
3 CD2 2005
4 AB3 2010
5 KK4 2007
6 AB3 2009
7 AB3 2009
8 AB1 2007
9 AB1 2000
10 CD2 2010
我想添加一个 xp 列,其中包含与 ID 匹配且年份值小于当前行的行数。我正在寻找类似的东西:
df$xp <- nrow( ID == "ID in current row" & year < "year in current row" )
结果应该是:
ID year xp
1 AB1 2005 1
2 AB1 2008 3
3 CD2 2005 0
4 AB3 2010 2
5 KK4 2007 0
6 AB3 2009 0
7 AB3 2009 0
8 AB1 2007 2
9 AB1 2000 0
10 CD2 2010 1
我确信有更简洁的基础 R 或 data.table 方法,但这里有一种使用 dplyr 和 tidyr 的方法。这种方法依赖于“non-equi 连接”,dplyr 目前不包括(但 data.table
和 sqldf
包括),所以我正在做笛卡尔连接然后过滤,这对于大数据来说效率会降低。
library(dplyr);library(tidyr)
left_join( # join...
df, # each row of df...
df %>% # with each matching row of a table where...
left_join(df, by = "ID") %>% # each row of df is joined to all the rows with same ID
filter(year.y < year.x) %>% # and we only keep preceding years
count(ID, year = year.x), # and we count how many there are per ID
by = c("ID", "year")) %>%
replace_na(list(n=0)) # and we replace the NA's with zeroes
这是一个data.table
解决方案:
library(data.table)
setDT(df)
df[, xp:=sapply(1:.N, \(x) sum(year < year[x])), by=ID][]
#> ID year xp
#> 1: AB1 2005 1
#> 2: AB1 2008 3
#> 3: CD2 2005 0
#> 4: AB3 2010 2
#> 5: KK4 2007 0
#> 6: AB3 2009 0
#> 7: AB3 2009 0
#> 8: AB1 2007 2
#> 9: AB1 2000 0
#> 10: CD2 2010 1
这是使用 dplyr
和 purrr
的方法:
library(dplyr)
library(purrr)
df %>%
group_by(ID) %>%
mutate(xp = map_int(year, function(x) sum(cur_data()$year < x)))
purrr::map_int
对 year
列的所有元素运行匿名函数。 dplyr::cur_data()
returns当前组的数据为数据框
问题中的伪代码几乎直接翻译成SQL。我们执行满足指定条件的 df 的左自连接。我们按行分组并计算 non-null joined-to 个元素。
library(sqldf)
sqldf("select a.*, count(b.ID) xp
from df a
left join df b on a.ID = b.ID and b.year < a.year
group by a.rowid")
我有这样一个数据框:
ID <- c("AB1","AB1","CD2","AB3","KK4","AB3","AB3","AB1","AB1","CD2")
year <- c(2005,2008,2005,2010,2007,2009,2009,2007,2000,2010)
df <- data.frame(ID, year)
df
ID year
1 AB1 2005
2 AB1 2008
3 CD2 2005
4 AB3 2010
5 KK4 2007
6 AB3 2009
7 AB3 2009
8 AB1 2007
9 AB1 2000
10 CD2 2010
我想添加一个 xp 列,其中包含与 ID 匹配且年份值小于当前行的行数。我正在寻找类似的东西:
df$xp <- nrow( ID == "ID in current row" & year < "year in current row" )
结果应该是:
ID year xp
1 AB1 2005 1
2 AB1 2008 3
3 CD2 2005 0
4 AB3 2010 2
5 KK4 2007 0
6 AB3 2009 0
7 AB3 2009 0
8 AB1 2007 2
9 AB1 2000 0
10 CD2 2010 1
我确信有更简洁的基础 R 或 data.table 方法,但这里有一种使用 dplyr 和 tidyr 的方法。这种方法依赖于“non-equi 连接”,dplyr 目前不包括(但 data.table
和 sqldf
包括),所以我正在做笛卡尔连接然后过滤,这对于大数据来说效率会降低。
library(dplyr);library(tidyr)
left_join( # join...
df, # each row of df...
df %>% # with each matching row of a table where...
left_join(df, by = "ID") %>% # each row of df is joined to all the rows with same ID
filter(year.y < year.x) %>% # and we only keep preceding years
count(ID, year = year.x), # and we count how many there are per ID
by = c("ID", "year")) %>%
replace_na(list(n=0)) # and we replace the NA's with zeroes
这是一个data.table
解决方案:
library(data.table)
setDT(df)
df[, xp:=sapply(1:.N, \(x) sum(year < year[x])), by=ID][]
#> ID year xp
#> 1: AB1 2005 1
#> 2: AB1 2008 3
#> 3: CD2 2005 0
#> 4: AB3 2010 2
#> 5: KK4 2007 0
#> 6: AB3 2009 0
#> 7: AB3 2009 0
#> 8: AB1 2007 2
#> 9: AB1 2000 0
#> 10: CD2 2010 1
这是使用 dplyr
和 purrr
的方法:
library(dplyr)
library(purrr)
df %>%
group_by(ID) %>%
mutate(xp = map_int(year, function(x) sum(cur_data()$year < x)))
purrr::map_int
对 year
列的所有元素运行匿名函数。 dplyr::cur_data()
returns当前组的数据为数据框
问题中的伪代码几乎直接翻译成SQL。我们执行满足指定条件的 df 的左自连接。我们按行分组并计算 non-null joined-to 个元素。
library(sqldf)
sqldf("select a.*, count(b.ID) xp
from df a
left join df b on a.ID = b.ID and b.year < a.year
group by a.rowid")