根据条件为每个 ID 提取多行
Extracting multiple rows for each ID based on a condition
我有一个包含数千行的数据框,但下面给出了一个示例:
userid event
1 123 view
2 123 view
3 123 order
4 345 view
5 345 view
6 345 view
7 345 order
8 111 view
9 111 order
10 111 view
11 111 view
12 111 view
13 333 view
14 333 view
15 333 view
dput(数据)
structure(list(userid = c(123, 123, 123, 345, 345, 345, 345,
111, 111, 111, 111, 111, 333, 333, 333), eventaction = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("order",
"view"), class = "factor")), .Names = c("userid", "event"
), row.names = c(NA, -15L), class = "data.frame")
我正在做的是提取事件下包含单词 "order" 的所有用户 ID 行。结果将包含用户 ID 的所有行,不包括用户 ID = 333,因为 eventaction 不包含订单条目。
第二个任务是计算订单输入前 "view" 的出现次数。我将不胜感激帮助和指点。
谢谢。
我们可以试试data.table
。将'data.frame'转换为'data.table'(setDT(data)
),按'userid'分组,if
有any
'event'即'order' 在 'userid', return Data.table 的子集' (.SD
)
library(data.table)
setDT(data)[,if(any(event=="order")) .SD , by = userid]
或者使用dplyr
,我们filter
for any
'order' in the 'event' after grouping by 'userid'.
library(dplyr)
data %>%
group_by(userid) %>%
filter(any(event == "order"))
使用标准 R,如果您将mydat
呼叫到您的data.frame:
myusers <- mydat[mydat$event == "order", "userid"]
mydat[mydat$userid %in% myusers,]
你可以这样做:
df[df$userid %in% df[df$event=="order",]$userid,]
或 subset
:
subset(df, df$userid %in% subset(df, event=="order")$userid)
或match
函数:
subset(df, match(df$userid, subset(df, event=="order")$userid, nomatch = 0)>0)
或使用 sqldf
库:
library(sqldf)
sqldf("select * from df where df.userid in (select df.userid from df where df.event=='order')")
# userid event
# 1 123 view
# 2 123 view
# 3 123 order
# 4 345 view
# 5 345 view
# 6 345 view
# 7 345 order
# 8 111 view
# 9 111 order
# 10 111 view
# 11 111 view
# 12 111 view
做你的第二个任务,可能有多个订单 userid
:
library(dplyr)
df %>% group_by(userid) %>%
mutate(row_num = row_number()) %>%
filter(event=="order") %>%
mutate(num_views_before=c(first(row_num),diff(row_num))-1)
备注:
- 我们
group_by
userid
.
- 我们添加一列来跟踪组的行。
- 我们只保留带有 "order" 的那些行。
- 我们使用
diff
预先创建的行号来计算每个订单之前的观看次数。
为了测试,我修改了您的数据,将第 12 行中的事件更改为 "order",这样 userid=111
就有两个订单。
修改数据:
structure(list(userid = c(123, 123, 123, 345, 345, 345, 345,
111, 111, 111, 111, 111, 333, 333, 333), event = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("order",
"view"), class = "factor")), .Names = c("userid", "event"), row.names = c(NA,
-15L), class = "data.frame")
## userid event
##1 123 view
##2 123 view
##3 123 order
##4 345 view
##5 345 view
##6 345 view
##7 345 order
##8 111 view
##9 111 order
##10 111 view
##11 111 view
##12 111 order
##13 333 view
##14 333 view
##15 333 view
根据这些数据,我们得到:
##Source: local data frame [4 x 4]
##Groups: userid [3]
##
## userid event row_num num_views_before
## <dbl> <fctr> <int> <dbl>
##1 123 order 3 2
##2 345 order 4 3
##3 111 order 2 1
##4 111 order 5 2
我有一个包含数千行的数据框,但下面给出了一个示例:
userid event
1 123 view
2 123 view
3 123 order
4 345 view
5 345 view
6 345 view
7 345 order
8 111 view
9 111 order
10 111 view
11 111 view
12 111 view
13 333 view
14 333 view
15 333 view
dput(数据)
structure(list(userid = c(123, 123, 123, 345, 345, 345, 345,
111, 111, 111, 111, 111, 333, 333, 333), eventaction = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("order",
"view"), class = "factor")), .Names = c("userid", "event"
), row.names = c(NA, -15L), class = "data.frame")
我正在做的是提取事件下包含单词 "order" 的所有用户 ID 行。结果将包含用户 ID 的所有行,不包括用户 ID = 333,因为 eventaction 不包含订单条目。
第二个任务是计算订单输入前 "view" 的出现次数。我将不胜感激帮助和指点。
谢谢。
我们可以试试data.table
。将'data.frame'转换为'data.table'(setDT(data)
),按'userid'分组,if
有any
'event'即'order' 在 'userid', return Data.table 的子集' (.SD
)
library(data.table)
setDT(data)[,if(any(event=="order")) .SD , by = userid]
或者使用dplyr
,我们filter
for any
'order' in the 'event' after grouping by 'userid'.
library(dplyr)
data %>%
group_by(userid) %>%
filter(any(event == "order"))
使用标准 R,如果您将mydat
呼叫到您的data.frame:
myusers <- mydat[mydat$event == "order", "userid"]
mydat[mydat$userid %in% myusers,]
你可以这样做:
df[df$userid %in% df[df$event=="order",]$userid,]
或 subset
:
subset(df, df$userid %in% subset(df, event=="order")$userid)
或match
函数:
subset(df, match(df$userid, subset(df, event=="order")$userid, nomatch = 0)>0)
或使用 sqldf
库:
library(sqldf)
sqldf("select * from df where df.userid in (select df.userid from df where df.event=='order')")
# userid event
# 1 123 view
# 2 123 view
# 3 123 order
# 4 345 view
# 5 345 view
# 6 345 view
# 7 345 order
# 8 111 view
# 9 111 order
# 10 111 view
# 11 111 view
# 12 111 view
做你的第二个任务,可能有多个订单 userid
:
library(dplyr)
df %>% group_by(userid) %>%
mutate(row_num = row_number()) %>%
filter(event=="order") %>%
mutate(num_views_before=c(first(row_num),diff(row_num))-1)
备注:
- 我们
group_by
userid
. - 我们添加一列来跟踪组的行。
- 我们只保留带有 "order" 的那些行。
- 我们使用
diff
预先创建的行号来计算每个订单之前的观看次数。
为了测试,我修改了您的数据,将第 12 行中的事件更改为 "order",这样 userid=111
就有两个订单。
修改数据:
structure(list(userid = c(123, 123, 123, 345, 345, 345, 345,
111, 111, 111, 111, 111, 333, 333, 333), event = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("order",
"view"), class = "factor")), .Names = c("userid", "event"), row.names = c(NA,
-15L), class = "data.frame")
## userid event
##1 123 view
##2 123 view
##3 123 order
##4 345 view
##5 345 view
##6 345 view
##7 345 order
##8 111 view
##9 111 order
##10 111 view
##11 111 view
##12 111 order
##13 333 view
##14 333 view
##15 333 view
根据这些数据,我们得到:
##Source: local data frame [4 x 4]
##Groups: userid [3]
##
## userid event row_num num_views_before
## <dbl> <fctr> <int> <dbl>
##1 123 order 3 2
##2 345 order 4 3
##3 111 order 2 1
##4 111 order 5 2