比较事件是新的还是已经存在

Question

有一种算法可以在网络中识别 issues/incidents。之后，它将所有案例写入数据库。让我们说它看起来像（简化）：

ID	Date	Case
A1	2022-01-01	1
B1	2022-01-01	2
C1	2022-01-01	3
A1	2022-01-02	NA
C1	2022-01-02	NA
A1	2022-01-03	NA
B1	2022-01-03	NA
C1	2022-01-03	NA

每一行代表一个事件。
现在我想确定上次我们运行这个脚本时事件是否已经存在。为此，它应该检查实际日期并将其与 table.
中的最后一个现有日期进行比较 注意：可能最后一天不是昨天，最多可能相差7天。

所以合乎逻辑的是：

将此 ID 组中第二高的 date 值与完整 df

Date

如果相同则表示案例已经存在。然后取最后一个 Case 号码。如果是新的，创建一个新的 Case 号码 (max(Case) + 1)

更新 11.05.2022 - 17:02:

它应该考虑现有的 Case 值而不是覆盖它们。或者换句话说，它应该 overwrite/fill NA。中间永远不会有 NAs。现有案例总是有编号，而新案例则没有。 100%.

预期结果：

ID	Date	Case	Comment
A1	2022-01-01	1
B1	2022-01-01	2
C1	2022-01-01	3
A1	2022-01-02	1
C1	2022-01-02	3
A1	2022-01-03	1
B1	2022-01-03	4	New case, as there wasn't B1 on 2022-01-02
C1	2022-01-03	3

我能够确定第二高的日期：

> df[, nth(unique(Date),length(unique(Date))-1), ID]
   ID         V1
1: A1 2022-01-02 ## TRUE, as it's the second highest Date
2: B1 2022-01-01 ## FALSE, as it's not the second highest Date
3: C1 2022-01-02 ## TRUE, as it's the second highest Date
> df[, nth(unique(Date),length(unique(Date))-1)]
[1] "2022-01-02" ## Second highest Date in df

但现在我正在努力创建一个具有这种情况的新专栏。有人可以帮忙吗？首选 data.table 解决方案，但 dplyr 也很棒。

MWE

library(data.table)

df = data.table(ID=c("A1", "B1", "C1", "A1", "C1", "A1", "B1", "C1"),
            Date=as.Date(c("2022-01-01","2022-01-01","2022-01-01","2022-01-02","2022-01-02","2022-01-03", "2022-01-03", "2022-01-03")),
            Case = NA)


Goal = data.table(ID=c("A1", "B1", "C1", "A1", "C1", "A1", "B1", "C1"),
                Date=as.Date(c("2022-01-01","2022-01-01","2022-01-01","2022-01-02","2022-01-02","2022-01-03", "2022-01-03", "2022-01-03")),
                Case=c(1,2,3,1,3,1,4,3))

Answer 1

这个怎么样：

df[order(Date), d:=c(1,diff(Date)), by = ID][
  order(d,ID),case:=rleid(ID,d)][
    ,d:=NULL]

输出：

   ID       Date case
1: A1 2022-01-01    1
2: B1 2022-01-01    2
3: C1 2022-01-01    3
4: A1 2022-01-02    1
5: C1 2022-01-02    3
6: A1 2022-01-03    1
7: B1 2022-01-03    4
8: C1 2022-01-03    3

如果你真的想要评论栏，你可以优化上面的内容，像这样：

df[order(Date), d:=c(1,diff(Date)), by = ID][
  order(d,ID),`:=`(
    case=rleid(ID,d),
    comment=fifelse(d!=1,paste0("New case, as there was no ", ID, " on ",Date-1),""))][
      ,d:=NULL][]

输出：

   ID       Date case                                    comment
1: A1 2022-01-01    1                                           
2: B1 2022-01-01    2                                           
3: C1 2022-01-01    3                                           
4: A1 2022-01-02    1                                           
5: C1 2022-01-02    3                                           
6: A1 2022-01-03    1                                           
7: B1 2022-01-03    4 New case, as there was no B1 on 2022-01-02
8: C1 2022-01-03    3

比较事件是新的还是已经存在

Compare whether incident is new or already exists

r

dplyr

data.table