dplyr：在每组末尾添加一个新行，根据前一行的变量计算

Question

关键问题

我可以用上一行的值填充新行。我可以将常量分配给新行中的变量。但是我不能根据前几行计算值并在新行中分配它们。

背景

我有来自 PLC 的真实数据，我准备将其转换为事件日志以供 bupaR 使用。以下数据是有限的和简化的，但包含有关资源、时间戳、状态类型和 event_ID.

的信息

已经实现

我添加了 Error_ID、Error_startTS、Error_EndTS 和 生命周期的一部分 ，如另一个 SO question
错误定义为以 state_type=="error" 开始的任何系列事件，直到遇到一个事件除了 "Error"、"Comlink Down"、"Not Active".
一个 错误编号 被分配给相同 "error-trace" ("Error_ID")
已分配错误的开始时间（第一个错误行的时间戳）("Error_startTS")
错误的结束时间，错误之后第第一行的时间戳，换句话说已分配结束错误的事件的时间戳 ("Error_endTS")

a "Life_cycle_ID" 被分配给错误的行，"Start" 或 "Ongoing"。

目标：

现在，我想插入一个新行

和Life_cycle_id == "Complete"

在每个 "error-trace"
的最后一行 "ongoing" 之后

详情

可通过 fill() 解决：从最后一行复制

"Resource"

"Error_ID"、

"Error_startTS",

"Error_endTS"

可用add.row()解决：分配一个常量

"Lifecycle_ID" 应该是 "Complete"

"State_type" 应该是 "Error"

对我来说有问题： 根据前几行的值赋值

时间戳"Datetime_local"应该在组
中得到"Error_endTS"的值
"event_ID"应该增加1

数据

my_df <- structure( list(Resource = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("L54", "L60", "L66", "L68", "L70", "L76", "L78", "L95", "L96", "L97", "L98", "L99"), class = "factor"), Datetime_local = structure(c(1535952594, 1535952618, 1535952643, 1535952651, 1535952787, 1535952835, 1535952840, 1535952846, 1535952890, 1535952949, 1535952952, 1535952958, 1535953066), class = c("POSIXct", "POSIXt"), tzone = ""), State_type = structure(c(6L, 4L, 8L, 4L, 8L, 4L, 12L, 4L, 8L, 4L, 12L, 4L, 12L), .Label = c("Comlink Down", "Comlink Up", "Counter", "Error", "Message", "No part in", "No part out", "Not active", "Part changing", "Part in", "Part out", "Producing", "Waiting"), class = "factor"), event_ID = c("e00000000000072160", "e00000000000072270", "e00000000000072400", "e00000000000072430", "e00000000000072810", "e00000000000073110", "e00000000000073150", "e00000000000073170", "e00000000000073300", "e00000000000073520", "e00000000000073540", "e00000000000073570", "e00000000000074040"), Error_ID = c(0, 1, 1, 1, 1, 1, 0, 2, 2, 2, 0, 3, 0), Error_startTS = structure(c(NA, 1535952618, 1535952618, 1535952618, 1535952618, 1535952618, NA, 1535952846, 1535952846, 1535952846, NA, 1535952958, NA), class = c("POSIXct", "POSIXt"), tzone = ""), Error_endTS = structure(c(NA, 1535952840, 1535952840, 1535952840, 1535952840, 1535952840, NA, 1535952952, 1535952952, 1535952952, NA, 1535953066, NA), class = c("POSIXct", "POSIXt"), tzone = ""), Lifecycle_ID = c(NA, "Start", "Ongoing", "Ongoing", "Ongoing", "Ongoing", NA, "Start", "Ongoing", "Ongoing", NA, "Start", NA)), .Names = c("Resource", "Datetime_local", "State_type", "event_ID", "Error_ID", "Error_startTS", "Error_endTS", "Lifecycle_ID"), row.names = 160:172, class = "data.frame")

...看起来像这样

# Resource Datetime_local State_type event_ID Error_ID Error_startTS Error_endTS Lifecycle_ID 160 L60 2018-09-03 07:29:54 No part in e00000000000072160 0 <NA> <NA> <NA> 161 L60 2018-09-03 07:30:18 Error e00000000000072270 1 2018-09-03 07:30:18 2018-09-03 07:34:00 Start 162 L60 2018-09-03 07:30:43 Not active e00000000000072400 1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing 163 L60 2018-09-03 07:30:51 Error e00000000000072430 1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing 164 L60 2018-09-03 07:33:07 Not active e00000000000072810 1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing 165 L60 2018-09-03 07:33:55 Error e00000000000073110 1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing 166 L60 2018-09-03 07:34:00 Producing e00000000000073150 0 <NA> <NA> <NA> 167 L60 2018-09-03 07:34:06 Error e00000000000073170 2 2018-09-03 07:34:06 2018-09-03 07:35:52 Start 168 L60 2018-09-03 07:34:50 Not active e00000000000073300 2 2018-09-03 07:34:06 2018-09-03 07:35:52 Ongoing 169 L60 2018-09-03 07:35:49 Error e00000000000073520 2 2018-09-03 07:34:06 2018-09-03 07:35:52 Ongoing 170 L60 2018-09-03 07:35:52 Producing e00000000000073540 0 <NA> <NA> <NA> 171 L60 2018-09-03 07:35:58 Error e00000000000073570 3 2018-09-03 07:35:58 2018-09-03 07:37:46 Start 172 L60 2018-09-03 07:37:46 Producing e00000000000074040 0 <NA> <NA> <NA>

UDF

ErrorNumberAddLastRow <- function(df){ df %>% mutate_if(is.factor, as.character) %>% group_by(Error_ID) %>% do(add_row(., Lifecycle_ID = "Complete", State_type = "Error")) %>% ungroup() %>% fill("Resource", "event_ID","Error_ID", "Error_startTS", "Error_endTS") %>% # mutate(event_ID = event_ID+1) %>% # error: non-numeric argument to binary operator. # mutate(Datetime_local = Error_endTS) %>% # assigns the same TS to the whole group arrange(event_ID) %>% filter( !(Error_ID==0 & Lifecycle_ID=="Complete") | is.na(Lifecycle_ID)) }

调用 udf

ErrorNumberAddLastRow(my_df)

给出这个结果

# A tibble: 16 x 8 Resource Datetime_local State_type event_ID Error_ID Error_startTS Error_endTS Lifecycle_ID <chr> <dttm> <chr> <chr> <dbl> <dttm> <dttm> <chr> 1 L60 2018-09-03 07:29:54 No part in e00000000000072160 0 NA NA NA 2 L60 2018-09-03 07:30:18 Error e00000000000072270 1 2018-09-03 07:30:18 2018-09-03 07:34:00 Start 3 L60 2018-09-03 07:30:43 Not active e00000000000072400 1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing 4 L60 2018-09-03 07:30:51 Error e00000000000072430 1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing 5 L60 2018-09-03 07:33:07 Not active e00000000000072810 1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing 6 L60 2018-09-03 07:33:55 Error e00000000000073110 1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing 7 L60 NA Error e00000000000073110 1 2018-09-03 07:30:18 2018-09-03 07:34:00 Complete 8 L60 2018-09-03 07:34:00 Producing e00000000000073150 0 NA NA NA 9 L60 2018-09-03 07:34:06 Error e00000000000073170 2 2018-09-03 07:34:06 2018-09-03 07:35:52 Start 10 L60 2018-09-03 07:34:50 Not active e00000000000073300 2 2018-09-03 07:34:06 2018-09-03 07:35:52 Ongoing 11 L60 2018-09-03 07:35:49 Error e00000000000073520 2 2018-09-03 07:34:06 2018-09-03 07:35:52 Ongoing 12 L60 NA Error e00000000000073520 2 2018-09-03 07:34:06 2018-09-03 07:35:52 Complete 13 L60 2018-09-03 07:35:52 Producing e00000000000073540 0 NA NA NA 14 L60 2018-09-03 07:35:58 Error e00000000000073570 3 2018-09-03 07:35:58 2018-09-03 07:37:46 Start 15 L60 NA Error e00000000000073570 3 2018-09-03 07:35:58 2018-09-03 07:37:46 Complete 16 L60 2018-09-03 07:37:46 Producing e00000000000074040 0 NA NA NA

想要的结果

# # A tibble: 16 x 8 # Resource Datetime_local State_type event_ID Error_ID Error_startTS Error_endTS Lifecycle_ID # <chr> <dttm> <chr> <chr> <dbl> <dttm> <dttm> <chr> # 1 L60 2018-09-03 07:29:54 No part in e00000000000072160 0 NA NA NA # 2 L60 2018-09-03 07:30:18 Error e00000000000072270 1 2018-09-03 07:30:18 2018-09-03 07:34:00 Start # 3 L60 2018-09-03 07:30:43 Not active e00000000000072400 1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing # 4 L60 2018-09-03 07:30:51 Error e00000000000072430 1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing # 5 L60 2018-09-03 07:33:07 Not active e00000000000072810 1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing # 6 L60 2018-09-03 07:33:55 Error e00000000000073110 1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing # 7 L60 2018-09-03 07:34:00 Error e00000000000073111 1 2018-09-03 07:30:18 2018-09-03 07:34:00 Complete # 8 L60 2018-09-03 07:34:00 Producing e00000000000073150 0 NA NA NA # 9 L60 2018-09-03 07:34:06 Error e00000000000073170 2 2018-09-03 07:34:06 2018-09-03 07:35:52 Start # 10 L60 2018-09-03 07:34:50 Not active e00000000000073300 2 2018-09-03 07:34:06 2018-09-03 07:35:52 Ongoing # 11 L60 2018-09-03 07:35:49 Error e00000000000073520 2 2018-09-03 07:34:06 2018-09-03 07:35:52 Ongoing # 12 L60 2018-09-03 07:35:52 Error e00000000000073521 2 2018-09-03 07:34:06 2018-09-03 07:35:52 Complete # 13 L60 2018-09-03 07:35:52 Producing e00000000000073540 0 NA NA NA # 14 L60 2018-09-03 07:35:58 Error e00000000000073570 3 2018-09-03 07:35:58 2018-09-03 07:37:46 Start # 15 L60 2018-09-03 07:37:46 Error e00000000000073571 3 2018-09-03 07:35:58 2018-09-03 07:37:46 Complete # 16 L60 2018-09-03 07:37:46 Producing e00000000000074040 0 NA NA NA

详细

第 7、12 和 15 行

增加 event_ID 1

将组的"Error_endTS"添加到Datetime_local时间戳

当您取消注释函数中的 mutate 语句时

mutate(event_ID = event_ID+1) %>%

...出现错误

Error in mutate_impl(.data, dots) : Evaluation error: non-numeric argument to binary operator.

mutate(Datetime_local = Error_endTS) %>%

...这会将相同的 TS 分配给整个组

谢谢你能给我的任何帮助。

Answer 1

这是一个想法

library(tidyverse)
library(gsubfn)

my_df %>%
  split(.$Error_ID) %>%
  map_dfr(~ add_row(.x, 
                    Lifecycle_ID = "Complete", 
                    State_type = "Error", 
                    # Take the last event_ID in each group, find the last digit 
                    # in the string, convert it to numeric and add +1
                    event_ID = gsubfn("\d{1}$", ~ as.numeric(x) + 1, last(.$event_ID)),
                    # Assign Datetime_local to the last Error_endTS in each group
                    Datetime_local = last(.$Error_endTS))) %>%
  fill("Resource", "Error_ID", "Error_startTS", "Error_endTS")

dplyr：在每组末尾添加一个新行，根据前一行的变量计算

dplyr: Adding a new row at the end of each group, calculated on variables from the previous row

r

event-log

dplyr

关键问题

背景

已经实现

目标：

数据

...看起来像这样

UDF

调用 udf

给出这个结果

想要的结果

当您取消注释函数中的 mutate 语句时