第一次遇到特定字符串后的子集数据框
Subset dataframe after first encounter of a specific string
我有一个以下格式的数据框,我想提取或子集数据框,这样我在每个项目中只有第一个 funding
activity 之前的活动:
project<- c('A', 'A', 'A', 'B', 'B', 'B','B', 'C', 'C')
activity<- c('kickoff','funding', 'delivery', 'kickoff','kickoff','funding','kickoff', 'kickoff','delivery')
df<- data.frame(project,activity)
我期待如下输出:
project activity
A kickoff
B kickoff
B kickoff
C kickoff
C delivery
有什么建议吗?
您可以尝试cumsum
跟踪每个项目是否在资助之前或之后发生了一行:
library(dplyr)
df %>%
group_by(project) %>%
mutate(before.funding = cumsum(activity == "funding") == 0) %>%
ungroup() %>%
filter(before.funding) %>%
select(-before.funding)
# A tibble: 5 x 2
project activity
<fctr> <fctr>
1 A kickoff
2 B kickoff
3 B kickoff
4 C kickoff
5 C delivery
dplyr
:
df %>%
group_by(project) %>%
dplyr::filter(cummin(activity != "funding") == 1)
产量:
# project activity
# <fctr> <fctr>
# 1 A kickoff
# 2 B kickoff
# 3 B kickoff
# 4 C kickoff
# 5 C delivery
base R
:
do.call(rbind, lapply(split(dff, dff$project), function(x) {
x[cummin(x$activity != "funding") == 1, ]
}))
产量:
# project activity
# A kickoff
# B kickoff
# B kickoff
# C kickoff
# C delivery
希望对您有所帮助。
为了完整起见,这里还有一个data.table
解决方案:
library(data.table)
setDT(df)[!df[, .I[.I >= first(.I[activity == 'funding'])], by = project]$V1]
project activity
1: A kickoff
2: B kickoff
3: B kickoff
4: C kickoff
5: C delivery
说明
在每个 project
组中,我们查找 "funding"
在第 activity
列和所有后续行中首次出现的索引:
df[, .I[.I >= first(.I[activity == 'funding'])], by = project]
project V1
1: A 2
2: A 3
3: B 6
4: B 7
在data.table
中,.I
是一个特殊的符号,保存在df
中的行位置。第二个子集 .I[.I >= first(.I[activity == 'funding'])]
是必需的,因为 which(.I >= first(.I[activity == 'funding']))
只会 return 行位置 在 组内而不是在 df
.
现在,我们已经确定了不应不显示的行。因此,我们通过排除这些行号得到最终结果:
df[!df[, .I[.I >= first(.I[activity == 'funding'])], by = project]$V1]
如果有可用的日期信息 - 我敢打赌在处理项目和活动时会有一个 date
专栏 - 我们可以按照@Frank 的建议做一个 anti non -equi join 使用日期列:
# create sample date with date column
project<- c('A', 'A', 'A', 'B', 'B', 'B','B', 'C', 'C')
activity<- c('kickoff','funding', 'delivery', 'kickoff','kickoff','funding','kickoff', 'kickoff','delivery')
date <- (as.Date ("2017-10-02") + c(1,4,7,2,5,8,11,3,6))
df <- data.frame(project,activity, date, stringsAsFactors = FALSE)
df <- df[order(df$date), ]
project activity date
1 A kickoff 2017-10-03
4 B kickoff 2017-10-04
8 C kickoff 2017-10-05
2 A funding 2017-10-06
5 B kickoff 2017-10-07
9 C delivery 2017-10-08
3 A delivery 2017-10-09
6 B funding 2017-10-10
7 B kickoff 2017-10-13
# anti non-equi join
setDT(df)[!df[activity == 'funding', first(date), by = project], on = .(project, date >= V1)]
project activity date
1: A kickoff 2017-10-03
2: B kickoff 2017-10-04
3: B kickoff 2017-10-07
4: C kickoff 2017-10-05
5: C delivery 2017-10-08
data.table
包的一些其他替代方案:
1) 与 Reduce
:
library(data.table)
setDT(df)[df[, .I[!Reduce('+', activity == 'funding', accumulate = TRUE)], project]$V1]
2) 与 cummax
:
library(data.table)
setDT(df)[df[, .I[!cummax(activity == 'funding')], project]$V1]
3) 与 pmax
:
library(data.table)
setDT(df)[!df[, pmax(.I, .I[activity == 'funding']), by = project]$V1]
我有一个以下格式的数据框,我想提取或子集数据框,这样我在每个项目中只有第一个 funding
activity 之前的活动:
project<- c('A', 'A', 'A', 'B', 'B', 'B','B', 'C', 'C')
activity<- c('kickoff','funding', 'delivery', 'kickoff','kickoff','funding','kickoff', 'kickoff','delivery')
df<- data.frame(project,activity)
我期待如下输出:
project activity
A kickoff
B kickoff
B kickoff
C kickoff
C delivery
有什么建议吗?
您可以尝试cumsum
跟踪每个项目是否在资助之前或之后发生了一行:
library(dplyr)
df %>%
group_by(project) %>%
mutate(before.funding = cumsum(activity == "funding") == 0) %>%
ungroup() %>%
filter(before.funding) %>%
select(-before.funding)
# A tibble: 5 x 2
project activity
<fctr> <fctr>
1 A kickoff
2 B kickoff
3 B kickoff
4 C kickoff
5 C delivery
dplyr
:
df %>%
group_by(project) %>%
dplyr::filter(cummin(activity != "funding") == 1)
产量:
# project activity
# <fctr> <fctr>
# 1 A kickoff
# 2 B kickoff
# 3 B kickoff
# 4 C kickoff
# 5 C delivery
base R
:
do.call(rbind, lapply(split(dff, dff$project), function(x) {
x[cummin(x$activity != "funding") == 1, ]
}))
产量:
# project activity
# A kickoff
# B kickoff
# B kickoff
# C kickoff
# C delivery
希望对您有所帮助。
为了完整起见,这里还有一个data.table
解决方案:
library(data.table)
setDT(df)[!df[, .I[.I >= first(.I[activity == 'funding'])], by = project]$V1]
project activity 1: A kickoff 2: B kickoff 3: B kickoff 4: C kickoff 5: C delivery
说明
在每个 project
组中,我们查找 "funding"
在第 activity
列和所有后续行中首次出现的索引:
df[, .I[.I >= first(.I[activity == 'funding'])], by = project]
project V1 1: A 2 2: A 3 3: B 6 4: B 7
在data.table
中,.I
是一个特殊的符号,保存在df
中的行位置。第二个子集 .I[.I >= first(.I[activity == 'funding'])]
是必需的,因为 which(.I >= first(.I[activity == 'funding']))
只会 return 行位置 在 组内而不是在 df
.
现在,我们已经确定了不应不显示的行。因此,我们通过排除这些行号得到最终结果:
df[!df[, .I[.I >= first(.I[activity == 'funding'])], by = project]$V1]
如果有可用的日期信息 - 我敢打赌在处理项目和活动时会有一个 date
专栏 - 我们可以按照@Frank 的建议做一个 anti non -equi join 使用日期列:
# create sample date with date column
project<- c('A', 'A', 'A', 'B', 'B', 'B','B', 'C', 'C')
activity<- c('kickoff','funding', 'delivery', 'kickoff','kickoff','funding','kickoff', 'kickoff','delivery')
date <- (as.Date ("2017-10-02") + c(1,4,7,2,5,8,11,3,6))
df <- data.frame(project,activity, date, stringsAsFactors = FALSE)
df <- df[order(df$date), ]
project activity date 1 A kickoff 2017-10-03 4 B kickoff 2017-10-04 8 C kickoff 2017-10-05 2 A funding 2017-10-06 5 B kickoff 2017-10-07 9 C delivery 2017-10-08 3 A delivery 2017-10-09 6 B funding 2017-10-10 7 B kickoff 2017-10-13
# anti non-equi join
setDT(df)[!df[activity == 'funding', first(date), by = project], on = .(project, date >= V1)]
project activity date 1: A kickoff 2017-10-03 2: B kickoff 2017-10-04 3: B kickoff 2017-10-07 4: C kickoff 2017-10-05 5: C delivery 2017-10-08
data.table
包的一些其他替代方案:
1) 与 Reduce
:
library(data.table)
setDT(df)[df[, .I[!Reduce('+', activity == 'funding', accumulate = TRUE)], project]$V1]
2) 与 cummax
:
library(data.table)
setDT(df)[df[, .I[!cummax(activity == 'funding')], project]$V1]
3) 与 pmax
:
library(data.table)
setDT(df)[!df[, pmax(.I, .I[activity == 'funding']), by = project]$V1]