使用重塑查找配对事件
Finding paired events with reshape
我有一个用户和他们在特定时间购买的物品的列表,我想从原始数据生成这些对的列表。虽然我可以并且可能会编写一个小的 python 脚本来完成它,但我有一种挥之不去的感觉,即 reshape(或者更可能是 reshape2)包可以在几行中完成。
在代码中,我希望将下面的 df 数据框转换为 resdf 数据框:
df <- data.frame(user=c("u1","u2","u1","u3","u2","u4","u5","u4"),
item=c("i1","i1","i2","i3","i2","i3","i3","i4"),
time=c(1,1,2,3,4,4,5,6))
> df
user item time
1 u1 i1 1
2 u2 i1 1
3 u1 i2 2
4 u3 i3 3
5 u2 i2 4
6 u4 i3 4
7 u5 i3 5
8 u4 i4 6
>
### some reshape code here
resdf <- data.frame(user=c("u1","u2","u4"),
item1=c("i1","i1","i3"),
item2=c("i2","i2","i4"),
time=c(1,1,4),
delt=c(1,3,2))
> pdf
user item1 item2 time delt
1 u1 i1 i2 1 1
2 u2 i1 i2 1 3
3 u4 i3 i4 4 2
有没有重塑向导可以帮我解决这个问题?
如果您将具有重复 user
值的行合并回没有重复值的行,您将获得所需的信息,然后进行一些调整即可提供所需的排列:
> merge(df[!duplicated(df$user), ], df[duplicated(df$user), ], by="user")
user item.x time.x item.y time.y
1 u1 i1 1 i2 2
2 u2 i1 1 i2 4
3 u4 i3 4 i4 6
> inter <- merge(df[!duplicated(df$user), ], df[duplicated(df$user), ], by="user")
> inter$delt <- inter$time.y-inter$time.x
> inter[ , c(1,2,4,3,6)]
user item.x item.y time.x delt
1 u1 i1 i2 1 1
2 u2 i1 i2 1 3
3 u4 i3 i4 4 2
这是我尝试使用 data.table
包(它也有一个 dcast
功能)
library(data.table)
setkey(setDT(df), user, item) # sorting by user and time so `head` and `diff` will work
df[, `:=`(indx = paste0("item", seq_len(.N)), # Creating all the needed variables
indx2 = .N,
time2 = head(time, 1),
delt = diff(time)),
user]
dcast(df[indx2 > 1L], # Decasting by the modified item column
user + time2 + delt ~ indx,
value.var = "item")
# user time2 delt item1 item2
# 1: u1 1 1 i1 i2
# 2: u2 1 3 i1 i2
# 3: u4 4 2 i3 i4
这是一个使用 dplyr
的解决方案:
library(dplyr)
df %>%
group_by(user) %>%
filter(n() == 2) %>%
arrange(time) %>%
summarise(
item1 = first(item),
item2 = last(item),
delt = last(time) - first(time),
time = first(time)
) %>%
select(user, item1, item2, time, delt)
我有一个用户和他们在特定时间购买的物品的列表,我想从原始数据生成这些对的列表。虽然我可以并且可能会编写一个小的 python 脚本来完成它,但我有一种挥之不去的感觉,即 reshape(或者更可能是 reshape2)包可以在几行中完成。
在代码中,我希望将下面的 df 数据框转换为 resdf 数据框:
df <- data.frame(user=c("u1","u2","u1","u3","u2","u4","u5","u4"),
item=c("i1","i1","i2","i3","i2","i3","i3","i4"),
time=c(1,1,2,3,4,4,5,6))
> df
user item time
1 u1 i1 1
2 u2 i1 1
3 u1 i2 2
4 u3 i3 3
5 u2 i2 4
6 u4 i3 4
7 u5 i3 5
8 u4 i4 6
>
### some reshape code here
resdf <- data.frame(user=c("u1","u2","u4"),
item1=c("i1","i1","i3"),
item2=c("i2","i2","i4"),
time=c(1,1,4),
delt=c(1,3,2))
> pdf
user item1 item2 time delt
1 u1 i1 i2 1 1
2 u2 i1 i2 1 3
3 u4 i3 i4 4 2
有没有重塑向导可以帮我解决这个问题?
如果您将具有重复 user
值的行合并回没有重复值的行,您将获得所需的信息,然后进行一些调整即可提供所需的排列:
> merge(df[!duplicated(df$user), ], df[duplicated(df$user), ], by="user")
user item.x time.x item.y time.y
1 u1 i1 1 i2 2
2 u2 i1 1 i2 4
3 u4 i3 4 i4 6
> inter <- merge(df[!duplicated(df$user), ], df[duplicated(df$user), ], by="user")
> inter$delt <- inter$time.y-inter$time.x
> inter[ , c(1,2,4,3,6)]
user item.x item.y time.x delt
1 u1 i1 i2 1 1
2 u2 i1 i2 1 3
3 u4 i3 i4 4 2
这是我尝试使用 data.table
包(它也有一个 dcast
功能)
library(data.table)
setkey(setDT(df), user, item) # sorting by user and time so `head` and `diff` will work
df[, `:=`(indx = paste0("item", seq_len(.N)), # Creating all the needed variables
indx2 = .N,
time2 = head(time, 1),
delt = diff(time)),
user]
dcast(df[indx2 > 1L], # Decasting by the modified item column
user + time2 + delt ~ indx,
value.var = "item")
# user time2 delt item1 item2
# 1: u1 1 1 i1 i2
# 2: u2 1 3 i1 i2
# 3: u4 4 2 i3 i4
这是一个使用 dplyr
的解决方案:
library(dplyr)
df %>%
group_by(user) %>%
filter(n() == 2) %>%
arrange(time) %>%
summarise(
item1 = first(item),
item2 = last(item),
delt = last(time) - first(time),
time = first(time)
) %>%
select(user, item1, item2, time, delt)