在 R 中聚合顺序数据和分组数据

Question

我有一个看起来像这个玩具示例的数据集。数据描述了一个人移动到的位置以及自移动发生以来的时间。例如，人 1 从农村开始，但在 463 天前搬到了城市（第 2 行），415 天前他从这个城市搬到了城镇（第 3 行），等等

set.seed(123)
df <- as.data.frame(sample.int(1000, 10))
colnames(df) <- "time"
df$destination <- as.factor(sample(c("city", "town", "rural"), size = 10, replace = TRUE, prob = c(.50, .25, .25)))
df$user <- sample.int(3, 10, replace = TRUE)
df[order(df[,"user"], -df[,"time"]), ]

数据：

time destination user
 526       rural    1
 463        city    1
 415        town    1
 299        city    1
 179       rural    1
 938        town    2
 229        town    2
 118        city    2
 818        city    3
 195        city    3

我希望将此数据汇总为以下格式。即统计每个用户的重定位类型，求和为一个矩阵。我如何实现这一点（最好不写循环）？

from  to     count
city  city   1
city  town   1
city  rural  1
town  city   2
town  town   1
town  rural  0
rural city   1
rural town   0
rural rural  0

Answer 1

这里有一个data.table选项

setDT(df)[
    ,
    setNames(
        rev(data.frame(embed(as.character(destination), 2))),
        c("from", "to")
    ), user
][, count := .N, .(from, to)][]

这给出了

   user  from    to count
1:    1 rural  city     1
2:    1  city  town     1
3:    1  town  city     2
4:    1  city rural     1
5:    2  town  town     1
6:    2  town  city     2
7:    3  city  city     1

Answer 2

一种基于data.table包的可能方式：

library(data.table)

cases <- unique(df$destination)

setDT(df)[, .(from=destination, to=shift(destination, -1)), by=user
          ][CJ(from=cases, to=cases), .(count=.N), by=.EACHI, on=c("from", "to")]


#      from     to count
#    <char> <char> <int>
# 1:   city   city     1
# 2:   city  rural     1
# 3:   city   town     1
# 4:  rural   city     1
# 5:  rural  rural     0
# 6:  rural   town     0
# 7:   town   city     2
# 8:   town  rural     0
# 9:   town   town     1

Answer 3

这是一个tidyverse解决方案：

library(dplyr)
library(purrr)

df %>%
  group_split(user) %>%
  map_dfr(~ bind_cols(as.character(.x[["destination"]][-nrow(.x)]), 
                  as.character(.x[["destination"]][-1])) %>%
        set_names("from", "to")) %>%
  group_by(from, to) %>%
  count()

# A tibble: 6 x 3
# Groups:   from, to [6]
  from  to        n
  <chr> <chr> <int>
1 city  city      1
2 city  rural     1
3 city  town      1
4 rural city      1
5 town  city      2
6 town  town      1

Answer 4

这是一个 dplyr 唯一的解决方案：

使用 lag 函数识别从到并与 paste0 合并到 helper 列。
删除由 lead
使用add_count改变n列

df %>% 
  group_by(user) %>% 
  rename(from = destination) %>% 
  mutate(to = lead(from), .before=3) %>% 
  mutate(helper = paste0(from, to)) %>% 
  filter(!is.na(to)) %>% 
  group_by(helper) %>% 
  add_count(helper, from, to) %>% 
  ungroup() %>% 
  select(user, from, to, n)

输出：

   user from  to        n
  <int> <fct> <fct> <int>
1     1 rural city      1
2     1 city  town      1
3     1 town  city      2
4     1 city  rural     1
5     2 town  town      1
6     2 town  city      2
7     3 city  city      1

在 R 中聚合顺序数据和分组数据

Aggregating sequential and grouped data in R

grouping

r

dataframe