如何根据数据框 B 中的多个条件在数据框 A 中创建新列

How to create a new column in dataframe A, based on multiple conditions in dataframe B

我想我没有看到树木中的森林...因此,我想寻求帮助。

Data: for dput() output see at the end of the question.

-Dataframe dfA with the columns: ID; ts. ts being POSIXct

-Dataframe dfB with the columns: ID; start; end; state_id. ID (corresponding to the ID in dfA), start (POSIXct), end (POSIXct), state_id.

任务:

我想根据条件在 dfA 中创建一个值为 1/0 的新列。条件语句:如果dfA和dfB中的ID匹配,并且时间戳dfA$ts在dfB$start和dfB$end之间或者等于dfB$start和dfB$end,那么dfA$x应该写入值1,否则写入0应该在那里。

我认为代码应该看起来像这样:

dfA$x <- iflese( dfA$ID == dfB$ID & dfA$ts >= dfB$start & dfA$ts <= dfB$end, 1, 0)

提前感谢您的帮助。

dput(dfB):

structure(list(ID = c(1151L, 1151L, 1150L, 1150L, 1150L, 1150L, 1152L, 1152L, 1152L, 1345L), start = structure(c(1443142500, 1443144600, 1442934900, 1442942400, 1442944800, 1442946300, 1443103500, 1443132600, 1443137400, 1443389400), class = c("POSIXct", "POSIXt" )), end = structure(c(1443143400, 1443145500, 1442935500, 1442943000, 1442945400, 1442950200, 1443106200, 1443134100, 1443140100, 1443392400 ), class = c("POSIXct", "POSIXt")), state_id = c(1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 1L)), row.names = c(NA, -10L), class = "data.frame")

dput(dfA):

structure(list(ID = c(1151L, 1151L, 1151L, 1151L, 1151L, 1151L, 1151L, 1151L, 1151L, 1151L, 1151L, 1151L, 1151L, 1151L, 1151L, 1151L, 1151L, 1151L, 1151L, 1151L, 1150L, 1150L, 1150L, 1150L, 1150L, 1150L, 1150L, 1150L, 1150L, 1150L, 1150L, 1150L, 1152L, 1152L, 1152L, 1152L, 1152L, 1152L, 1152L, 1152L, 1152L, 1152L, 1152L, 1152L, 1152L, 1152L, 1152L, 1152L, 1345L, 1345L, 1345L, 1345L, 1345L, 1345L, 1345L, 1345L, 1345L, 1345L), ts = structure(c(1443141300, 1443141600, 1443141900, 1443142200, 1443142500, 1443142800, 1443143100, 1443143400, 1443143700, 1443144000, 1443144300, 1443144600, 1443144900, 1443145200, 1443145500, 1443145800, 1443146100, 1443146400, 1443146700, 1443147000, 1442934900, 1442935200, 1442935500, 1442935800, 1442936100, 1442936400, 1442936700, 1442937000, 1442937300, 1442937600, 1442937900, 1442938200, 1443103500, 1443103800, 1443104100, 1443104400, 1443104700, 1443105000, 1443105300, 1443105600, 1443105900, 1443106200, 1443106500, 1443106800, 1443107100, 1443107400, 1443107700, 1443108000, 1443369300, 1443369600, 1443369900, 1443370200, 1443370500, 1443370800, 1443371100, 1443371400, 1443371700, 1443372000), class = c("POSIXct", "POSIXt" ))), row.names = c(NA, -58L), class = "data.frame")

我们可以 left_join dfAdfB 通过 'ID', group_by 每个 IDts 并检查是否any 值在该组的范围内。

library(dplyr)

dfA %>%
  left_join(dfB, by = 'ID') %>%
  group_by(ID, ts) %>% 
  summarise(x = +any(ts >= start & ts <= end))


#      ID ts                      x
#   <int> <dttm>              <int>
# 1  1150 2015-09-22 23:15:00     1
# 2  1150 2015-09-22 23:20:00     1
# 3  1150 2015-09-22 23:25:00     1
# 4  1150 2015-09-22 23:30:00     0
# 5  1150 2015-09-22 23:35:00     0
# 6  1150 2015-09-22 23:40:00     0
# 7  1150 2015-09-22 23:45:00     0
# 8  1150 2015-09-22 23:50:00     0
# 9  1150 2015-09-22 23:55:00     0
#10  1150 2015-09-23 00:00:00     0
# … with 48 more rows

可以使用data.table,条件替换为:=:

data.table::setDT(dfA)
dfA[,value := 0L]
dfA[(get('ID') == dfB$ID) & (get('ts') >= dfB$start) & (get('ts') <= dfB$end), value := 1L]

请注意,我将 dfA 中的列名放在 get 调用中以避免与 dfB 列混淆。

在这种情况下 dfAdfB 必须具有相同的行数。如果他们不这样做,请使用基于 ID 列的 merge

这是一个更新的非相等连接:

library(data.table)

setDT(dfA); setDT(dfB)

dfA[dfB, match := 1, on=.(ID=ID, ts>=start, ts<=end)][, match:=ifelse(is.na(match), 0, match)]

      ID                  ts match
 1: 1151 2015-09-25 07:35:00     0
 2: 1151 2015-09-25 07:40:00     0
 3: 1151 2015-09-25 07:45:00     0
 4: 1151 2015-09-25 07:50:00     0
 5: 1151 2015-09-25 07:55:00     1
 6: 1151 2015-09-25 08:00:00     1
 7: 1151 2015-09-25 08:05:00     1
 8: 1151 2015-09-25 08:10:00     1
 9: 1151 2015-09-25 08:15:00     0
10: 1151 2015-09-25 08:20:00     0
...