比较事件是新的还是已经存在
Compare whether incident is new or already exists
有一种算法可以在网络中识别 issues/incidents。之后,它将所有案例写入数据库。让我们说它看起来像(简化):
ID
Date
Case
A1
2022-01-01
1
B1
2022-01-01
2
C1
2022-01-01
3
A1
2022-01-02
NA
C1
2022-01-02
NA
A1
2022-01-03
NA
B1
2022-01-03
NA
C1
2022-01-03
NA
每一行代表一个事件。
现在我想确定上次我们 运行 这个脚本时事件是否已经存在。为此,它应该检查实际日期并将其与 table.
中的最后一个现有日期进行比较
注意:可能最后一天不是昨天,最多可能相差7天。
所以合乎逻辑的是:
- 将此
ID
组中第二高的 date
值与完整 df
中的第二高 Date
值进行比较
- 如果相同则表示案例已经存在。然后取最后一个
Case
号码。如果是新的,创建一个新的 Case
号码 (max(Case) + 1
)
更新 11.05.2022 - 17:02:
- 它应该考虑现有的
Case
值而不是覆盖它们。或者换句话说,它应该 overwrite/fill NA
。中间永远不会有 NA
s。现有案例总是有编号,而新案例则没有。 100%.
预期结果:
ID
Date
Case
Comment
A1
2022-01-01
1
B1
2022-01-01
2
C1
2022-01-01
3
A1
2022-01-02
1
C1
2022-01-02
3
A1
2022-01-03
1
B1
2022-01-03
4
New case, as there wasn't B1 on 2022-01-02
C1
2022-01-03
3
我能够确定第二高的日期:
> df[, nth(unique(Date),length(unique(Date))-1), ID]
ID V1
1: A1 2022-01-02 ## TRUE, as it's the second highest Date
2: B1 2022-01-01 ## FALSE, as it's not the second highest Date
3: C1 2022-01-02 ## TRUE, as it's the second highest Date
> df[, nth(unique(Date),length(unique(Date))-1)]
[1] "2022-01-02" ## Second highest Date in df
但现在我正在努力创建一个具有这种情况的新专栏。有人可以帮忙吗?首选 data.table
解决方案,但 dplyr
也很棒。
MWE
library(data.table)
df = data.table(ID=c("A1", "B1", "C1", "A1", "C1", "A1", "B1", "C1"),
Date=as.Date(c("2022-01-01","2022-01-01","2022-01-01","2022-01-02","2022-01-02","2022-01-03", "2022-01-03", "2022-01-03")),
Case = NA)
Goal = data.table(ID=c("A1", "B1", "C1", "A1", "C1", "A1", "B1", "C1"),
Date=as.Date(c("2022-01-01","2022-01-01","2022-01-01","2022-01-02","2022-01-02","2022-01-03", "2022-01-03", "2022-01-03")),
Case=c(1,2,3,1,3,1,4,3))
这个怎么样:
df[order(Date), d:=c(1,diff(Date)), by = ID][
order(d,ID),case:=rleid(ID,d)][
,d:=NULL]
输出:
ID Date case
1: A1 2022-01-01 1
2: B1 2022-01-01 2
3: C1 2022-01-01 3
4: A1 2022-01-02 1
5: C1 2022-01-02 3
6: A1 2022-01-03 1
7: B1 2022-01-03 4
8: C1 2022-01-03 3
如果你真的想要评论栏,你可以优化上面的内容,像这样:
df[order(Date), d:=c(1,diff(Date)), by = ID][
order(d,ID),`:=`(
case=rleid(ID,d),
comment=fifelse(d!=1,paste0("New case, as there was no ", ID, " on ",Date-1),""))][
,d:=NULL][]
输出:
ID Date case comment
1: A1 2022-01-01 1
2: B1 2022-01-01 2
3: C1 2022-01-01 3
4: A1 2022-01-02 1
5: C1 2022-01-02 3
6: A1 2022-01-03 1
7: B1 2022-01-03 4 New case, as there was no B1 on 2022-01-02
8: C1 2022-01-03 3
有一种算法可以在网络中识别 issues/incidents。之后,它将所有案例写入数据库。让我们说它看起来像(简化):
ID | Date | Case |
---|---|---|
A1 | 2022-01-01 | 1 |
B1 | 2022-01-01 | 2 |
C1 | 2022-01-01 | 3 |
A1 | 2022-01-02 | NA |
C1 | 2022-01-02 | NA |
A1 | 2022-01-03 | NA |
B1 | 2022-01-03 | NA |
C1 | 2022-01-03 | NA |
每一行代表一个事件。
现在我想确定上次我们 运行 这个脚本时事件是否已经存在。为此,它应该检查实际日期并将其与 table.
中的最后一个现有日期进行比较
注意:可能最后一天不是昨天,最多可能相差7天。
所以合乎逻辑的是:
- 将此
ID
组中第二高的date
值与完整df
中的第二高 - 如果相同则表示案例已经存在。然后取最后一个
Case
号码。如果是新的,创建一个新的Case
号码 (max(Case) + 1
)
Date
值进行比较
更新 11.05.2022 - 17:02:
- 它应该考虑现有的
Case
值而不是覆盖它们。或者换句话说,它应该 overwrite/fillNA
。中间永远不会有NA
s。现有案例总是有编号,而新案例则没有。 100%.
预期结果:
ID | Date | Case | Comment |
---|---|---|---|
A1 | 2022-01-01 | 1 | |
B1 | 2022-01-01 | 2 | |
C1 | 2022-01-01 | 3 | |
A1 | 2022-01-02 | 1 | |
C1 | 2022-01-02 | 3 | |
A1 | 2022-01-03 | 1 | |
B1 | 2022-01-03 | 4 | New case, as there wasn't B1 on 2022-01-02 |
C1 | 2022-01-03 | 3 |
我能够确定第二高的日期:
> df[, nth(unique(Date),length(unique(Date))-1), ID]
ID V1
1: A1 2022-01-02 ## TRUE, as it's the second highest Date
2: B1 2022-01-01 ## FALSE, as it's not the second highest Date
3: C1 2022-01-02 ## TRUE, as it's the second highest Date
> df[, nth(unique(Date),length(unique(Date))-1)]
[1] "2022-01-02" ## Second highest Date in df
但现在我正在努力创建一个具有这种情况的新专栏。有人可以帮忙吗?首选 data.table
解决方案,但 dplyr
也很棒。
MWE
library(data.table)
df = data.table(ID=c("A1", "B1", "C1", "A1", "C1", "A1", "B1", "C1"),
Date=as.Date(c("2022-01-01","2022-01-01","2022-01-01","2022-01-02","2022-01-02","2022-01-03", "2022-01-03", "2022-01-03")),
Case = NA)
Goal = data.table(ID=c("A1", "B1", "C1", "A1", "C1", "A1", "B1", "C1"),
Date=as.Date(c("2022-01-01","2022-01-01","2022-01-01","2022-01-02","2022-01-02","2022-01-03", "2022-01-03", "2022-01-03")),
Case=c(1,2,3,1,3,1,4,3))
这个怎么样:
df[order(Date), d:=c(1,diff(Date)), by = ID][
order(d,ID),case:=rleid(ID,d)][
,d:=NULL]
输出:
ID Date case
1: A1 2022-01-01 1
2: B1 2022-01-01 2
3: C1 2022-01-01 3
4: A1 2022-01-02 1
5: C1 2022-01-02 3
6: A1 2022-01-03 1
7: B1 2022-01-03 4
8: C1 2022-01-03 3
如果你真的想要评论栏,你可以优化上面的内容,像这样:
df[order(Date), d:=c(1,diff(Date)), by = ID][
order(d,ID),`:=`(
case=rleid(ID,d),
comment=fifelse(d!=1,paste0("New case, as there was no ", ID, " on ",Date-1),""))][
,d:=NULL][]
输出:
ID Date case comment
1: A1 2022-01-01 1
2: B1 2022-01-01 2
3: C1 2022-01-01 3
4: A1 2022-01-02 1
5: C1 2022-01-02 3
6: A1 2022-01-03 1
7: B1 2022-01-03 4 New case, as there was no B1 on 2022-01-02
8: C1 2022-01-03 3