根据条件为每个 ID 提取多行

Extracting multiple rows for each ID based on a condition

我有一个包含数千行的数据框,但下面给出了一个示例:

     userid     event
1     123        view
2     123        view
3     123       order
4     345        view
5     345        view
6     345        view
7     345       order
8     111        view
9     111       order
10    111        view
11    111        view
12    111        view
13    333        view
14    333        view
15    333        view

dput(数据)

structure(list(userid = c(123, 123, 123, 345, 345, 345, 345, 
111, 111, 111, 111, 111, 333, 333, 333), eventaction = structure(c(2L, 
2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("order", 
"view"), class = "factor")), .Names = c("userid", "event"
), row.names = c(NA, -15L), class = "data.frame")

我正在做的是提取事件下包含单词 "order" 的所有用户 ID 行。结果将包含用户 ID 的所有行,不包括用户 ID = 333,因为 eventaction 不包含订单条目。

第二个任务是计算订单输入前 "view" 的出现次数。我将不胜感激帮助和指点。

谢谢。

我们可以试试data.table。将'data.frame'转换为'data.table'(setDT(data)),按'userid'分组,ifany'event'即'order' 在 'userid', return Data.table 的子集' (.SD)

library(data.table)
setDT(data)[,if(any(event=="order")) .SD , by = userid]

或者使用dplyr,我们filter for any 'order' in the 'event' after grouping by 'userid'.

library(dplyr)
data %>%
    group_by(userid) %>%
    filter(any(event == "order"))

使用标准 R,如果您将mydat呼叫到您的data.frame:

myusers <- mydat[mydat$event == "order", "userid"]
mydat[mydat$userid %in% myusers,]

你可以这样做:

df[df$userid %in% df[df$event=="order",]$userid,]

subset:

subset(df, df$userid %in% subset(df, event=="order")$userid)

match函数:

subset(df, match(df$userid, subset(df, event=="order")$userid, nomatch = 0)>0)

或使用 sqldf 库:

library(sqldf)
sqldf("select * from df where df.userid in (select df.userid from df where df.event=='order')")

   # userid event
# 1     123  view
# 2     123  view
# 3     123 order
# 4     345  view
# 5     345  view
# 6     345  view
# 7     345 order
# 8     111  view
# 9     111 order
# 10    111  view
# 11    111  view
# 12    111  view

做你的第二个任务,可能有多个订单 userid:

library(dplyr)
df %>% group_by(userid) %>% 
       mutate(row_num = row_number()) %>% 
       filter(event=="order") %>% 
       mutate(num_views_before=c(first(row_num),diff(row_num))-1)

备注:

  1. 我们group_byuserid.
  2. 我们添加一列来跟踪组的行。
  3. 我们只保留带有 "order" 的那些行。
  4. 我们使用 diff 预先创建的行号来计算每个订单之前的观看次数。

为了测试,我修改了您的数据,将第 12 行中的事件更改为 "order",这样 userid=111 就有两个订单。

修改数据:

structure(list(userid = c(123, 123, 123, 345, 345, 345, 345, 
111, 111, 111, 111, 111, 333, 333, 333), event = structure(c(2L, 
2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("order", 
"view"), class = "factor")), .Names = c("userid", "event"), row.names = c(NA, 
-15L), class = "data.frame")
##   userid event
##1     123  view
##2     123  view
##3     123 order
##4     345  view
##5     345  view
##6     345  view
##7     345 order
##8     111  view
##9     111 order
##10    111  view
##11    111  view
##12    111 order
##13    333  view
##14    333  view
##15    333  view

根据这些数据,我们得到:

##Source: local data frame [4 x 4]
##Groups: userid [3]
##
##  userid  event row_num num_views_before
##   <dbl> <fctr>   <int>            <dbl>
##1    123  order       3                2
##2    345  order       4                3
##3    111  order       2                1
##4    111  order       5                2