在 R 中聚合顺序数据和分组数据
Aggregating sequential and grouped data in R
我有一个看起来像这个玩具示例的数据集。数据描述了一个人移动到的位置以及自移动发生以来的时间。例如,人 1 从农村开始,但在 463 天前搬到了城市(第 2 行),415 天前他从这个城市搬到了城镇(第 3 行),等等
set.seed(123)
df <- as.data.frame(sample.int(1000, 10))
colnames(df) <- "time"
df$destination <- as.factor(sample(c("city", "town", "rural"), size = 10, replace = TRUE, prob = c(.50, .25, .25)))
df$user <- sample.int(3, 10, replace = TRUE)
df[order(df[,"user"], -df[,"time"]), ]
数据:
time destination user
526 rural 1
463 city 1
415 town 1
299 city 1
179 rural 1
938 town 2
229 town 2
118 city 2
818 city 3
195 city 3
我希望将此数据汇总为以下格式。即统计每个用户的重定位类型,求和为一个矩阵。我如何实现这一点(最好不写循环)?
from to count
city city 1
city town 1
city rural 1
town city 2
town town 1
town rural 0
rural city 1
rural town 0
rural rural 0
这里有一个data.table
选项
setDT(df)[
,
setNames(
rev(data.frame(embed(as.character(destination), 2))),
c("from", "to")
), user
][, count := .N, .(from, to)][]
这给出了
user from to count
1: 1 rural city 1
2: 1 city town 1
3: 1 town city 2
4: 1 city rural 1
5: 2 town town 1
6: 2 town city 2
7: 3 city city 1
一种基于data.table
包的可能方式:
library(data.table)
cases <- unique(df$destination)
setDT(df)[, .(from=destination, to=shift(destination, -1)), by=user
][CJ(from=cases, to=cases), .(count=.N), by=.EACHI, on=c("from", "to")]
# from to count
# <char> <char> <int>
# 1: city city 1
# 2: city rural 1
# 3: city town 1
# 4: rural city 1
# 5: rural rural 0
# 6: rural town 0
# 7: town city 2
# 8: town rural 0
# 9: town town 1
这是一个tidyverse
解决方案:
library(dplyr)
library(purrr)
df %>%
group_split(user) %>%
map_dfr(~ bind_cols(as.character(.x[["destination"]][-nrow(.x)]),
as.character(.x[["destination"]][-1])) %>%
set_names("from", "to")) %>%
group_by(from, to) %>%
count()
# A tibble: 6 x 3
# Groups: from, to [6]
from to n
<chr> <chr> <int>
1 city city 1
2 city rural 1
3 city town 1
4 rural city 1
5 town city 2
6 town town 1
这是一个 dplyr
唯一的解决方案:
- 使用
lag
函数识别从 到 并与 paste0
合并到 helper
列。
- 删除由
lead
引起的 NA
- 使用
add_count
改变n
列
df %>%
group_by(user) %>%
rename(from = destination) %>%
mutate(to = lead(from), .before=3) %>%
mutate(helper = paste0(from, to)) %>%
filter(!is.na(to)) %>%
group_by(helper) %>%
add_count(helper, from, to) %>%
ungroup() %>%
select(user, from, to, n)
输出:
user from to n
<int> <fct> <fct> <int>
1 1 rural city 1
2 1 city town 1
3 1 town city 2
4 1 city rural 1
5 2 town town 1
6 2 town city 2
7 3 city city 1
我有一个看起来像这个玩具示例的数据集。数据描述了一个人移动到的位置以及自移动发生以来的时间。例如,人 1 从农村开始,但在 463 天前搬到了城市(第 2 行),415 天前他从这个城市搬到了城镇(第 3 行),等等
set.seed(123)
df <- as.data.frame(sample.int(1000, 10))
colnames(df) <- "time"
df$destination <- as.factor(sample(c("city", "town", "rural"), size = 10, replace = TRUE, prob = c(.50, .25, .25)))
df$user <- sample.int(3, 10, replace = TRUE)
df[order(df[,"user"], -df[,"time"]), ]
数据:
time destination user
526 rural 1
463 city 1
415 town 1
299 city 1
179 rural 1
938 town 2
229 town 2
118 city 2
818 city 3
195 city 3
我希望将此数据汇总为以下格式。即统计每个用户的重定位类型,求和为一个矩阵。我如何实现这一点(最好不写循环)?
from to count
city city 1
city town 1
city rural 1
town city 2
town town 1
town rural 0
rural city 1
rural town 0
rural rural 0
这里有一个data.table
选项
setDT(df)[
,
setNames(
rev(data.frame(embed(as.character(destination), 2))),
c("from", "to")
), user
][, count := .N, .(from, to)][]
这给出了
user from to count
1: 1 rural city 1
2: 1 city town 1
3: 1 town city 2
4: 1 city rural 1
5: 2 town town 1
6: 2 town city 2
7: 3 city city 1
一种基于data.table
包的可能方式:
library(data.table)
cases <- unique(df$destination)
setDT(df)[, .(from=destination, to=shift(destination, -1)), by=user
][CJ(from=cases, to=cases), .(count=.N), by=.EACHI, on=c("from", "to")]
# from to count
# <char> <char> <int>
# 1: city city 1
# 2: city rural 1
# 3: city town 1
# 4: rural city 1
# 5: rural rural 0
# 6: rural town 0
# 7: town city 2
# 8: town rural 0
# 9: town town 1
这是一个tidyverse
解决方案:
library(dplyr)
library(purrr)
df %>%
group_split(user) %>%
map_dfr(~ bind_cols(as.character(.x[["destination"]][-nrow(.x)]),
as.character(.x[["destination"]][-1])) %>%
set_names("from", "to")) %>%
group_by(from, to) %>%
count()
# A tibble: 6 x 3
# Groups: from, to [6]
from to n
<chr> <chr> <int>
1 city city 1
2 city rural 1
3 city town 1
4 rural city 1
5 town city 2
6 town town 1
这是一个 dplyr
唯一的解决方案:
- 使用
lag
函数识别从 到 并与paste0
合并到helper
列。 - 删除由
lead
引起的 NA
- 使用
add_count
改变n
列
df %>%
group_by(user) %>%
rename(from = destination) %>%
mutate(to = lead(from), .before=3) %>%
mutate(helper = paste0(from, to)) %>%
filter(!is.na(to)) %>%
group_by(helper) %>%
add_count(helper, from, to) %>%
ungroup() %>%
select(user, from, to, n)
输出:
user from to n
<int> <fct> <fct> <int>
1 1 rural city 1
2 1 city town 1
3 1 town city 2
4 1 city rural 1
5 2 town town 1
6 2 town city 2
7 3 city city 1