使用 data.table 来识别所有事件的发生,条件是依次选择第一个出现的事件
Using data.table to identify all event occurence with condition of picking first occurence if in sequence
我正在尝试识别一个事件的所有发生,如果按顺序重复,则选择第一个发生的事件。我可以标记和添加计数,但无法在事件发生变化后重置计数。
我的数据有大约 100 万行和 30 个奇数 ID。我只添加了一个 ID,但我的数据中有 30 个奇怪的 ID。 table 有一个 ID、日期时间和状态。
状态是可以有多个值的事件-A,B,C ...我关心的事件是B。
我要添加三列 -
Occurrence_B - 事件标志为 B
Count_B - 计算 event=B 的连续发生次数,并在事件更改时重置
Include_B - 显示该特定事件是首次出现还是连续出现的标志
我将对 Include_B='new' 的数据进行子集化,以选择序列中第一次出现的数据。
ID Date Status Occurrence_B Count_B Include_B
A 7/28/15 12:00 AM A 0 0 0
A 7/28/15 12:30 AM A 0 0 0
A 7/30/15 12:00 AM B 1 1 new
A 7/31/15 12:00 AM B 1 2 continued
A 7/31/15 11:00 AM B 1 3 continued
A 8/2/15 10:00 AM B 0 0 0
A 8/3/15 12:00 AM C 0 0 0
A 8/4/15 12:00 AM B 1 1 new
A 8/5/15 12:00 AM B 1 2 continued
A 8/6/15 12:00 AM A 1 0 continued
A 8/7/15 12:00 AM B 1 1 new
我的示例代码--
d1[, Occurrence_B:=Status %in% c('B')+0L]
d1[, Count_B := cumsum(Occurrence_B), by=.(ID,Status)]
问题是一旦事件发生变化,我不知道如何重置 count_B。我正在尝试调查,但我是 data.table 的新手,所以非常感谢任何帮助。
如果您有任何问题,请告诉我。
SK
您可以这样尝试:
# create Occurrence_B column and initialize Include_B as NA
(d1[, `:=` (Occurrence_B = as.integer(Status == "B"), Include_B = NA_character_)]
# calculate Count_B use rleid(Occurrence_B) as group variable which will group consecutive
# same values together
[, Count_B := cumsum(Occurrence_B), by = rleid(Occurrence_B)]
# Update the Include_B variable in place based on Count_B, when Count_B == 1, it appears
# the first time, when Count_B > 1, it is continued, otherwise keep them as NA
[Count_B == 1, Include_B := "new"][Count_B > 1, Include_B := "continued"][])
# ID Date Status Occurrence_B Count_B Include_B
# 1: A 7/28/15 12:00 AM A 0 0 NA
# 2: A 7/28/15 12:30 AM A 0 0 NA
# 3: A 7/30/15 12:00 AM B 1 1 new
# 4: A 7/31/15 12:00 AM B 1 2 continued
# 5: A 7/31/15 11:00 AM B 1 3 continued
# 6: A 8/2/15 10:00 AM B 1 4 continued
# 7: A 8/3/15 12:00 AM C 0 0 NA
# 8: A 8/4/15 12:00 AM B 1 1 new
# 9: A 8/5/15 12:00 AM B 1 2 continued
#10: A 8/6/15 12:00 AM A 0 0 NA
#11: A 8/7/15 12:00 AM B 1 1 new
我正在尝试识别一个事件的所有发生,如果按顺序重复,则选择第一个发生的事件。我可以标记和添加计数,但无法在事件发生变化后重置计数。
我的数据有大约 100 万行和 30 个奇数 ID。我只添加了一个 ID,但我的数据中有 30 个奇怪的 ID。 table 有一个 ID、日期时间和状态。
状态是可以有多个值的事件-A,B,C ...我关心的事件是B。
我要添加三列 -
Occurrence_B - 事件标志为 B
Count_B - 计算 event=B 的连续发生次数,并在事件更改时重置
Include_B - 显示该特定事件是首次出现还是连续出现的标志
我将对 Include_B='new' 的数据进行子集化,以选择序列中第一次出现的数据。
ID Date Status Occurrence_B Count_B Include_B
A 7/28/15 12:00 AM A 0 0 0
A 7/28/15 12:30 AM A 0 0 0
A 7/30/15 12:00 AM B 1 1 new
A 7/31/15 12:00 AM B 1 2 continued
A 7/31/15 11:00 AM B 1 3 continued
A 8/2/15 10:00 AM B 0 0 0
A 8/3/15 12:00 AM C 0 0 0
A 8/4/15 12:00 AM B 1 1 new
A 8/5/15 12:00 AM B 1 2 continued
A 8/6/15 12:00 AM A 1 0 continued
A 8/7/15 12:00 AM B 1 1 new
我的示例代码--
d1[, Occurrence_B:=Status %in% c('B')+0L]
d1[, Count_B := cumsum(Occurrence_B), by=.(ID,Status)]
问题是一旦事件发生变化,我不知道如何重置 count_B。我正在尝试调查,但我是 data.table 的新手,所以非常感谢任何帮助。
如果您有任何问题,请告诉我。
SK
您可以这样尝试:
# create Occurrence_B column and initialize Include_B as NA
(d1[, `:=` (Occurrence_B = as.integer(Status == "B"), Include_B = NA_character_)]
# calculate Count_B use rleid(Occurrence_B) as group variable which will group consecutive
# same values together
[, Count_B := cumsum(Occurrence_B), by = rleid(Occurrence_B)]
# Update the Include_B variable in place based on Count_B, when Count_B == 1, it appears
# the first time, when Count_B > 1, it is continued, otherwise keep them as NA
[Count_B == 1, Include_B := "new"][Count_B > 1, Include_B := "continued"][])
# ID Date Status Occurrence_B Count_B Include_B
# 1: A 7/28/15 12:00 AM A 0 0 NA
# 2: A 7/28/15 12:30 AM A 0 0 NA
# 3: A 7/30/15 12:00 AM B 1 1 new
# 4: A 7/31/15 12:00 AM B 1 2 continued
# 5: A 7/31/15 11:00 AM B 1 3 continued
# 6: A 8/2/15 10:00 AM B 1 4 continued
# 7: A 8/3/15 12:00 AM C 0 0 NA
# 8: A 8/4/15 12:00 AM B 1 1 new
# 9: A 8/5/15 12:00 AM B 1 2 continued
#10: A 8/6/15 12:00 AM A 0 0 NA
#11: A 8/7/15 12:00 AM B 1 1 new