汇总时间序列的多个组内的数据
Summarize data within multiple groups of a time series
我在不同地点和时间对鸟类进行了一系列观察。数据框如下所示:
birdID site ts
1 A 2013-04-15 09:29
1 A 2013-04-19 01:22
1 A 2013-04-20 23:13
1 A 2013-04-22 00:03
1 B 2013-04-22 14:02
1 B 2013-04-22 17:02
1 C 2013-04-22 14:04
1 C 2013-04-22 15:18
1 C 2013-04-23 00:54
1 A 2013-04-23 01:20
1 A 2013-04-24 23:07
1 A 2013-04-30 23:47
1 B 2013-04-30 03:51
1 B 2013-04-30 04:26
2 C 2013-04-30 04:29
2 C 2013-04-30 18:49
2 A 2013-05-01 01:03
2 A 2013-05-01 23:15
2 A 2013-05-02 00:09
2 C 2013-05-03 07:57
2 C 2013-05-04 07:21
2 C 2013-05-05 02:54
2 A 2013-05-05 03:27
2 A 2013-05-14 00:16
2 D 2013-05-14 10:00
2 D 2013-05-14 15:00
我想以一种方式总结数据,显示每个地点每只鸟的第一次和最后一次检测,以及每个地点的持续时间,同时保留有关多次访问地点的信息(即如果一只鸟去从站点 A > B > C > A > B,我想独立显示对站点 A 和 B 的每次访问,而不是将两次访问混为一谈)。
我希望产生这样的输出,其中保留每次访问的开始 (min_ts)、结束 (max_ts) 和持续时间(天):
birdID site min_ts max_ts days
1 A 2013-04-15 09:29 2013-04-22 00:03 6.6
1 B 2013-04-22 14:02 2013-04-22 17:02 0.1
1 C 2013-04-22 14:04 2013-04-23 00:54 0.5
1 A 2013-04-23 01:20 2013-04-30 23:47 7.9
1 B 2013-04-30 03:51 2013-04-30 04:26 0.02
2 C 2013-04-30 4:29 2013-04-30 18:49 0.6
2 A 2013-05-01 01:03 2013-05-02 00:09 0.96
2 C 2013-05-03 07:57 2013-05-05 02:54 1.8
2 A 2013-05-05 03:27 2013-05-14 00:16 8.8
2 D 2013-05-14 10:00 2013-05-14 15:00 0.2
我试过这段代码,它产生了正确的变量,但将有关单个站点的所有信息集中在一起,没有保留多次访问:
df <- df %>%
group_by(birdID, site) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
arrange(birdID, min_ts)
birdID site min_ts max_ts days
1 A 2013-04-15 09:29 2013-04-30 23:47 15.6
1 B 2013-04-22 14:02 2013-04-30 4:26 7.6
1 C 2013-04-22 14:04 2013-04-23 0:54 0.5
2 C 2013-04-30 04:29 2013-05-05 2:54 4.9
2 A 2013-05-01 01:03 2013-05-14 0:16 12.9
2 D 2013-05-14 10:00 2013-05-14 15:00 0.2
我意识到按网站分组是个问题,但如果我将其作为分组变量删除,则数据汇总时没有网站信息。我试过这个。它没有 运行,但我觉得它接近解决方案:
df <- df %>%
group_by(birdID) %>%
summarize(min_ts = if_else((birdID == lag(birdID) & site != lag(site)), min(ts), NA_real_),
max_ts = if_else((birdID == lag(birdID) & site != lag(site)), max(ts), NA_real_),
min_d = min(yday(ts)),
max_d = max(yday(ts)),
days = max_d - min_d))
一种可能是:
df %>%
group_by(birdID, site, rleid = with(rle(site), rep(seq_along(lengths), lengths))) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
ungroup() %>%
select(-rleid) %>%
arrange(birdID, min_ts)
birdID site min_ts max_ts days
<int> <chr> <dttm> <dttm> <drtn>
1 1 A 2013-04-15 09:29:00 2013-04-22 00:03:00 6.60694444 days
2 1 B 2013-04-22 14:02:00 2013-04-22 17:02:00 0.12500000 days
3 1 C 2013-04-22 14:04:00 2013-04-23 00:54:00 0.45138889 days
4 1 A 2013-04-23 01:20:00 2013-04-30 23:47:00 7.93541667 days
5 1 B 2013-04-30 03:51:00 2013-04-30 04:26:00 0.02430556 days
6 2 C 2013-04-30 04:29:00 2013-04-30 18:49:00 0.59722222 days
7 2 A 2013-05-01 01:03:00 2013-05-02 00:09:00 0.96250000 days
8 2 C 2013-05-03 07:57:00 2013-05-05 02:54:00 1.78958333 days
9 2 A 2013-05-05 03:27:00 2013-05-14 00:16:00 8.86736111 days
10 2 D 2013-05-14 10:00:00 2013-05-14 15:00:00 0.20833333 days
这里创建了一个类似rleid()
的分组变量,然后计算差值。
或明确使用 data.table
中的 rleid()
:
df %>%
group_by(birdID, site, rleid = rleid(site)) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
ungroup() %>%
select(-rleid) %>%
arrange(birdID, min_ts)
另一种方法是使用 lag
和 cumsum
创建分组变量。
library(dplyr)
df %>%
group_by(birdID, group = cumsum(site != lag(site, default = first(site)))) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
ungroup() %>%
select(-group)
# A tibble: 10 x 4
# birdID min_ts max_ts days
# <int> <dttm> <dttm> <drtn>
# 1 1 2013-04-15 09:29:00 2013-04-22 00:03:00 6.60694444 days
# 2 1 2013-04-22 14:02:00 2013-04-22 17:02:00 0.12500000 days
# 3 1 2013-04-22 14:04:00 2013-04-23 00:54:00 0.45138889 days
# 4 1 2013-04-23 01:20:00 2013-04-30 23:47:00 7.93541667 days
# 5 1 2013-04-30 03:51:00 2013-04-30 04:26:00 0.02430556 days
# 6 2 2013-04-30 04:29:00 2013-04-30 18:49:00 0.59722222 days
# 7 2 2013-05-01 01:03:00 2013-05-02 00:09:00 0.96250000 days
# 8 2 2013-05-03 07:57:00 2013-05-05 02:54:00 1.78958333 days
# 9 2 2013-05-05 03:27:00 2013-05-14 00:16:00 8.86736111 days
#10 2 2013-05-14 10:00:00 2013-05-14 15:00:00 0.20833333 days
我在不同地点和时间对鸟类进行了一系列观察。数据框如下所示:
birdID site ts
1 A 2013-04-15 09:29
1 A 2013-04-19 01:22
1 A 2013-04-20 23:13
1 A 2013-04-22 00:03
1 B 2013-04-22 14:02
1 B 2013-04-22 17:02
1 C 2013-04-22 14:04
1 C 2013-04-22 15:18
1 C 2013-04-23 00:54
1 A 2013-04-23 01:20
1 A 2013-04-24 23:07
1 A 2013-04-30 23:47
1 B 2013-04-30 03:51
1 B 2013-04-30 04:26
2 C 2013-04-30 04:29
2 C 2013-04-30 18:49
2 A 2013-05-01 01:03
2 A 2013-05-01 23:15
2 A 2013-05-02 00:09
2 C 2013-05-03 07:57
2 C 2013-05-04 07:21
2 C 2013-05-05 02:54
2 A 2013-05-05 03:27
2 A 2013-05-14 00:16
2 D 2013-05-14 10:00
2 D 2013-05-14 15:00
我想以一种方式总结数据,显示每个地点每只鸟的第一次和最后一次检测,以及每个地点的持续时间,同时保留有关多次访问地点的信息(即如果一只鸟去从站点 A > B > C > A > B,我想独立显示对站点 A 和 B 的每次访问,而不是将两次访问混为一谈)。
我希望产生这样的输出,其中保留每次访问的开始 (min_ts)、结束 (max_ts) 和持续时间(天):
birdID site min_ts max_ts days
1 A 2013-04-15 09:29 2013-04-22 00:03 6.6
1 B 2013-04-22 14:02 2013-04-22 17:02 0.1
1 C 2013-04-22 14:04 2013-04-23 00:54 0.5
1 A 2013-04-23 01:20 2013-04-30 23:47 7.9
1 B 2013-04-30 03:51 2013-04-30 04:26 0.02
2 C 2013-04-30 4:29 2013-04-30 18:49 0.6
2 A 2013-05-01 01:03 2013-05-02 00:09 0.96
2 C 2013-05-03 07:57 2013-05-05 02:54 1.8
2 A 2013-05-05 03:27 2013-05-14 00:16 8.8
2 D 2013-05-14 10:00 2013-05-14 15:00 0.2
我试过这段代码,它产生了正确的变量,但将有关单个站点的所有信息集中在一起,没有保留多次访问:
df <- df %>%
group_by(birdID, site) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
arrange(birdID, min_ts)
birdID site min_ts max_ts days
1 A 2013-04-15 09:29 2013-04-30 23:47 15.6
1 B 2013-04-22 14:02 2013-04-30 4:26 7.6
1 C 2013-04-22 14:04 2013-04-23 0:54 0.5
2 C 2013-04-30 04:29 2013-05-05 2:54 4.9
2 A 2013-05-01 01:03 2013-05-14 0:16 12.9
2 D 2013-05-14 10:00 2013-05-14 15:00 0.2
我意识到按网站分组是个问题,但如果我将其作为分组变量删除,则数据汇总时没有网站信息。我试过这个。它没有 运行,但我觉得它接近解决方案:
df <- df %>%
group_by(birdID) %>%
summarize(min_ts = if_else((birdID == lag(birdID) & site != lag(site)), min(ts), NA_real_),
max_ts = if_else((birdID == lag(birdID) & site != lag(site)), max(ts), NA_real_),
min_d = min(yday(ts)),
max_d = max(yday(ts)),
days = max_d - min_d))
一种可能是:
df %>%
group_by(birdID, site, rleid = with(rle(site), rep(seq_along(lengths), lengths))) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
ungroup() %>%
select(-rleid) %>%
arrange(birdID, min_ts)
birdID site min_ts max_ts days
<int> <chr> <dttm> <dttm> <drtn>
1 1 A 2013-04-15 09:29:00 2013-04-22 00:03:00 6.60694444 days
2 1 B 2013-04-22 14:02:00 2013-04-22 17:02:00 0.12500000 days
3 1 C 2013-04-22 14:04:00 2013-04-23 00:54:00 0.45138889 days
4 1 A 2013-04-23 01:20:00 2013-04-30 23:47:00 7.93541667 days
5 1 B 2013-04-30 03:51:00 2013-04-30 04:26:00 0.02430556 days
6 2 C 2013-04-30 04:29:00 2013-04-30 18:49:00 0.59722222 days
7 2 A 2013-05-01 01:03:00 2013-05-02 00:09:00 0.96250000 days
8 2 C 2013-05-03 07:57:00 2013-05-05 02:54:00 1.78958333 days
9 2 A 2013-05-05 03:27:00 2013-05-14 00:16:00 8.86736111 days
10 2 D 2013-05-14 10:00:00 2013-05-14 15:00:00 0.20833333 days
这里创建了一个类似rleid()
的分组变量,然后计算差值。
或明确使用 data.table
中的 rleid()
:
df %>%
group_by(birdID, site, rleid = rleid(site)) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
ungroup() %>%
select(-rleid) %>%
arrange(birdID, min_ts)
另一种方法是使用 lag
和 cumsum
创建分组变量。
library(dplyr)
df %>%
group_by(birdID, group = cumsum(site != lag(site, default = first(site)))) %>%
summarise(min_ts = min(ts),
max_ts = max(ts),
days = difftime(max_ts, min_ts, units = "days")) %>%
ungroup() %>%
select(-group)
# A tibble: 10 x 4
# birdID min_ts max_ts days
# <int> <dttm> <dttm> <drtn>
# 1 1 2013-04-15 09:29:00 2013-04-22 00:03:00 6.60694444 days
# 2 1 2013-04-22 14:02:00 2013-04-22 17:02:00 0.12500000 days
# 3 1 2013-04-22 14:04:00 2013-04-23 00:54:00 0.45138889 days
# 4 1 2013-04-23 01:20:00 2013-04-30 23:47:00 7.93541667 days
# 5 1 2013-04-30 03:51:00 2013-04-30 04:26:00 0.02430556 days
# 6 2 2013-04-30 04:29:00 2013-04-30 18:49:00 0.59722222 days
# 7 2 2013-05-01 01:03:00 2013-05-02 00:09:00 0.96250000 days
# 8 2 2013-05-03 07:57:00 2013-05-05 02:54:00 1.78958333 days
# 9 2 2013-05-05 03:27:00 2013-05-14 00:16:00 8.86736111 days
#10 2 2013-05-14 10:00:00 2013-05-14 15:00:00 0.20833333 days