根据 dplyr 中的组变量计算时间戳之间的时间量
Compute the amount of time between time stamps depending on a group var in dplyr
我一直在为下面的问题绞尽脑汁。
假设我有演讲者的文字记录数据。每一行都是说话者的特定话语(由 speaker_id
给出),当他们开始说话时,Timestamp
。有时连续的行会包含来自同一说话者的话语,因为他们在停顿后继续说话。
我想计算出数据集中发言者发言的总时间,这需要我计算时间戳开始与下一位发言者第一个时间戳开始之间的差异。我该怎么做?
这是数据集的示例
file = read.table(text = "speaker_id,Timestamp
5,2022-03-30 03:00:00
5,2022-03-30 03:00:24
3,2022-03-30 03:00:52
3,2022-03-30 03:00:56
3,2022-03-30 03:00:58
5,2022-03-30 03:01:25
5,2022-03-30 03:02:15
3,2022-03-30 03:03:14
5,2022-03-30 03:03:36
3,2022-03-30 03:04:26
3,2022-03-30 03:06:02
3,2022-03-30 03:06:10
5,2022-03-30 03:06:28
5,2022-03-30 03:07:28
3,2022-03-30 03:08:56
5,2022-03-30 03:09:11
5,2022-03-30 03:10:02
5,2022-03-30 03:10:56
3,2022-03-30 03:11:53
5,2022-03-30 03:12:20", header = T, sep = ",")
有什么想法吗?
一个选项可以是:
file %>%
mutate(Timestamp2 = lead(Timestamp, default = last(Timestamp))) %>%
group_by(speaker_id,
rleid = cumsum(speaker_id != lag(speaker_id, default = first(speaker_id)))) %>%
mutate(Timestamp_diff = last(ymd_hms(Timestamp2)) - first(ymd_hms(Timestamp)))
speaker_id Timestamp Timestamp2 rleid Timestamp_diff
<int> <chr> <chr> <int> <drtn>
1 5 2022-03-30 03:00:00 2022-03-30 03:00:24 0 52 secs
2 5 2022-03-30 03:00:24 2022-03-30 03:00:52 0 52 secs
3 3 2022-03-30 03:00:52 2022-03-30 03:00:56 1 33 secs
4 3 2022-03-30 03:00:56 2022-03-30 03:00:58 1 33 secs
5 3 2022-03-30 03:00:58 2022-03-30 03:01:25 1 33 secs
6 5 2022-03-30 03:01:25 2022-03-30 03:02:15 2 109 secs
7 5 2022-03-30 03:02:15 2022-03-30 03:03:14 2 109 secs
8 3 2022-03-30 03:03:14 2022-03-30 03:03:36 3 22 secs
9 5 2022-03-30 03:03:36 2022-03-30 03:04:26 4 50 secs
10 3 2022-03-30 03:04:26 2022-03-30 03:06:02 5 122 secs
11 3 2022-03-30 03:06:02 2022-03-30 03:06:10 5 122 secs
12 3 2022-03-30 03:06:10 2022-03-30 03:06:28 5 122 secs
13 5 2022-03-30 03:06:28 2022-03-30 03:07:28 6 148 secs
14 5 2022-03-30 03:07:28 2022-03-30 03:08:56 6 148 secs
15 3 2022-03-30 03:08:56 2022-03-30 03:09:11 7 15 secs
16 5 2022-03-30 03:09:11 2022-03-30 03:10:02 8 162 secs
17 5 2022-03-30 03:10:02 2022-03-30 03:10:56 8 162 secs
18 5 2022-03-30 03:10:56 2022-03-30 03:11:53 8 162 secs
19 3 2022-03-30 03:11:53 2022-03-30 03:12:20 9 27 secs
20 5 2022-03-30 03:12:20 2022-03-30 03:12:20 10 0 secs
这是一种使用 {dplyr} 的方法:
- 将时间戳列转换为 as.POSIXct
- 计算当前时间戳和下一个时间戳之间的差异
dplyr::lead
- 分组依据speaker_id
- 总结每个演讲者的持续时间。使用 na.rm = TRUE 因为最后一行的持续时间将为 missing/NA
library(dplyr)
file %>%
mutate(
Timestamp = as.POSIXct(Timestamp),
duration = lead(Timestamp) - Timestamp) %>%
group_by(speaker_id) %>%
summarize(total_duration = sum(duration, na.rm = TRUE))
我一直在为下面的问题绞尽脑汁。
假设我有演讲者的文字记录数据。每一行都是说话者的特定话语(由 speaker_id
给出),当他们开始说话时,Timestamp
。有时连续的行会包含来自同一说话者的话语,因为他们在停顿后继续说话。
我想计算出数据集中发言者发言的总时间,这需要我计算时间戳开始与下一位发言者第一个时间戳开始之间的差异。我该怎么做?
这是数据集的示例
file = read.table(text = "speaker_id,Timestamp
5,2022-03-30 03:00:00
5,2022-03-30 03:00:24
3,2022-03-30 03:00:52
3,2022-03-30 03:00:56
3,2022-03-30 03:00:58
5,2022-03-30 03:01:25
5,2022-03-30 03:02:15
3,2022-03-30 03:03:14
5,2022-03-30 03:03:36
3,2022-03-30 03:04:26
3,2022-03-30 03:06:02
3,2022-03-30 03:06:10
5,2022-03-30 03:06:28
5,2022-03-30 03:07:28
3,2022-03-30 03:08:56
5,2022-03-30 03:09:11
5,2022-03-30 03:10:02
5,2022-03-30 03:10:56
3,2022-03-30 03:11:53
5,2022-03-30 03:12:20", header = T, sep = ",")
有什么想法吗?
一个选项可以是:
file %>%
mutate(Timestamp2 = lead(Timestamp, default = last(Timestamp))) %>%
group_by(speaker_id,
rleid = cumsum(speaker_id != lag(speaker_id, default = first(speaker_id)))) %>%
mutate(Timestamp_diff = last(ymd_hms(Timestamp2)) - first(ymd_hms(Timestamp)))
speaker_id Timestamp Timestamp2 rleid Timestamp_diff
<int> <chr> <chr> <int> <drtn>
1 5 2022-03-30 03:00:00 2022-03-30 03:00:24 0 52 secs
2 5 2022-03-30 03:00:24 2022-03-30 03:00:52 0 52 secs
3 3 2022-03-30 03:00:52 2022-03-30 03:00:56 1 33 secs
4 3 2022-03-30 03:00:56 2022-03-30 03:00:58 1 33 secs
5 3 2022-03-30 03:00:58 2022-03-30 03:01:25 1 33 secs
6 5 2022-03-30 03:01:25 2022-03-30 03:02:15 2 109 secs
7 5 2022-03-30 03:02:15 2022-03-30 03:03:14 2 109 secs
8 3 2022-03-30 03:03:14 2022-03-30 03:03:36 3 22 secs
9 5 2022-03-30 03:03:36 2022-03-30 03:04:26 4 50 secs
10 3 2022-03-30 03:04:26 2022-03-30 03:06:02 5 122 secs
11 3 2022-03-30 03:06:02 2022-03-30 03:06:10 5 122 secs
12 3 2022-03-30 03:06:10 2022-03-30 03:06:28 5 122 secs
13 5 2022-03-30 03:06:28 2022-03-30 03:07:28 6 148 secs
14 5 2022-03-30 03:07:28 2022-03-30 03:08:56 6 148 secs
15 3 2022-03-30 03:08:56 2022-03-30 03:09:11 7 15 secs
16 5 2022-03-30 03:09:11 2022-03-30 03:10:02 8 162 secs
17 5 2022-03-30 03:10:02 2022-03-30 03:10:56 8 162 secs
18 5 2022-03-30 03:10:56 2022-03-30 03:11:53 8 162 secs
19 3 2022-03-30 03:11:53 2022-03-30 03:12:20 9 27 secs
20 5 2022-03-30 03:12:20 2022-03-30 03:12:20 10 0 secs
这是一种使用 {dplyr} 的方法:
- 将时间戳列转换为 as.POSIXct
- 计算当前时间戳和下一个时间戳之间的差异
dplyr::lead
- 分组依据speaker_id
- 总结每个演讲者的持续时间。使用 na.rm = TRUE 因为最后一行的持续时间将为 missing/NA
library(dplyr)
file %>%
mutate(
Timestamp = as.POSIXct(Timestamp),
duration = lead(Timestamp) - Timestamp) %>%
group_by(speaker_id) %>%
summarize(total_duration = sum(duration, na.rm = TRUE))