事件随时间序列的滚动计数

Question

我正在尝试计算时间范围内按组的滚动 count/sum 发生次数。

我有一个数据框，其中包含一些示例数据，如下所示：

dates = as.Date(c("2011-10-09",
        "2011-10-15",
        "2011-10-16", 
        "2011-10-18", 
        "2011-10-21", 
        "2011-10-22", 
        "2011-10-24"))

group1=c("A",
         "C",
         "A", 
         "A", 
         "L", 
         "F", 
         "A")
group2=c("D",
         "A",
         "B", 
         "H", 
         "A", 
         "A", 
         "E")

df1 <- data.frame(dates, group1, group2)

我为每个唯一的 'group' 迭代单独的数据帧，因此例如这就是 "A" 的组的外观（它们出现在每一行中，无论是在 group1 还是 group2 中）。

我想计算 "A"（然后是每个组）时间范围内事件发生的次数 - 事件的 'date'（即当前行日期）和前 4 天。我想向前滚动，所以例如第 1 行的计数为 1，第 2 行的计数也为 1（过去 4 天除当前日期外没有任何事件），第 3 行有 2，行4 将有 3 等等

对于每一行，我想以一列结束，基本上说，在这个事件日期，在当前日期（如日期列中所示）发生了 X 件事件，并且在过去的 4 天里。

Answer 1

对于此示例，您可能可以使用 sapply 来分析每一行，计算当天或最多 4 天前的条目数，如下所示：

df1$lastFour <-
  sapply(df1$dates, function(x){
    sum(df1$dates <= x & df1$dates >= x - 4)
  })

df1 个结果：

       dates group1 group2 lastFour
1 2011-10-09      A      D        1
2 2011-10-15      C      A        1
3 2011-10-16      A      B        2
4 2011-10-18      A      H        3
5 2011-10-21      L      A        2
6 2011-10-22      F      A        3
7 2011-10-24      A      E        3

如果正如你的问题所暗示的那样，你的数据来自一个更大的集合并且你想对每个组进行分析（从概念上讲，我认为问题是：有多少事件有这个组 过去四天？仅在有该组活动的日子询问），您可以按照以下步骤操作。

首先，这里有一些更大的样本数据，其中的组标记为字母表的前 10 个字母：

biggerData <-
  data.frame(
    dates = sample(seq(as.Date("2011-10-01")
                       , as.Date("2011-10-31")
                       , 1)
                   , 100, TRUE)
    , group1 = sample(LETTERS[1:10], 100, TRUE)
    , group2 = sample(LETTERS[1:10], 100, TRUE)
  )

接下来，我提取数据中的所有组（在这里，我知道它们，但对于您的真实数据，您可能已经或可能没有该组列表）

groupsInData <-
  sort(unique(c(as.character(biggerData$group1)
                , as.character(biggerData$group2))))

然后，我循环遍历该组名称向量，并将该组的每个事件提取为两个组之一，添加与上面相同的列，并将单独的 data.frames 保存在列表中（并命名它们以便于 access/track 它们）。

sepGroupCounts <- lapply(groupsInData, function(thisGroup){
  dfTemp <- biggerData[biggerData$group1 == thisGroup | 
                         biggerData$group2 == thisGroup, ]

  dfTemp$lastFour <-
    sapply(dfTemp$dates, function(x){
      sum(dfTemp$dates <= x & dfTemp$dates >= x - 4)
    })
  return(dfTemp)

}) 

names(sepGroupCounts) <- groupsInData

returns a data.frame 就像上面对数据中的每个组一样。

而且，我无法帮助自己，所以这里还有一个 dplyr 和 tidyr 的解决方案。它与上面基于列表的解决方案没有太大区别，除了它 returns 一切都在同一个 data.frame 中（这可能是也可能不是好事，特别是因为它每个都有两个条目这样的事件）。

首先，为简单起见，我定义了一个函数来进行日期检查。这也可以很容易地在上面使用。

myDateCheckFunction <- function(x){
  sapply(x, function(thisX){
    sum(x <= thisX & x >= thisX - 4 )
  })
}

接下来，我将构建一组逻辑测试，以确定每个组是否存在。这些将用于为每个组生成列，在每个事件中为 present/absent 提供 TRUE/FALSE。

dotsConstruct <-
  paste0("group1 == '", groupsInData, "' | "
         , "group2 == '", groupsInData, "'") %>%
  setNames(groupsInData)

最后，将其全部放在一个管道调用中。我没有描述，而是对每个步骤进行了评论。

withLastFour <-
  # Start with data
  biggerData %>%
  # Add a col for each group using Standard Evaluation
  mutate_(.dots = dotsConstruct) %>%
  # convert to long form; one row per group per event
  gather(GroupAnalyzed, Present, -dates, -group1, -group2) %>%
  # Limit to only rows where the `GroupAnalyzed` is present
  filter(Present) %>%
  # Remove the `Present` column, as it is now all "TRUE"
  select(-Present) %>%
  # Group by the groups we are analyzing
  group_by(GroupAnalyzed) %>%
  # Add the column for count in the last four dates
  # `group_by` limits this to just counts within that group
  mutate(lastFour = myDateCheckFunction(dates)) %>%
  # Sort by group and date for prettier checking
  arrange(GroupAnalyzed, dates)

结果类似于上面的 list 输出，除了所有内容都在一个 data.frame 中，这样可以更轻松地分析某些特征。顶部看起来像这样：

       dates group1 group2 GroupAnalyzed lastFour
      <date> <fctr> <fctr>         <chr>    <int>
1 2011-10-01      B      A             A        1
2 2011-10-02      J      A             A        2
3 2011-10-05      C      A             A        5
4 2011-10-05      C      A             A        5
5 2011-10-05      G      A             A        5
6 2011-10-08      E      A             A        5

请注意，我的随机样本在 Oct-05 发生了多个事件，导致此处出现大量计数。

Answer 2

我认为，但我不确定，您正在寻找一种方法来计算每个日期（行）和前四天的每个事件类型（字母）的出现次数，无论前四天是否出现在你的数据中。如果这是正确的，那么这里有一种方法使用 dplyr（为了一般方便），tidyr（为了使宽数据更容易按日期计数）和 zoo（为了它的 rollapply函数）。

library(dplyr)
library(tidyr)
library(zoo)

df2 <- df1 %>%
  # make the wide data long so we can group and then count by date
  gather(key = group, value = event, group1:group2) %>%
  # group by date
  group_by(dates) %>%
  # count occurrences of the event of interest on each date
  summarise(sum.a = sum(event == "A")) %>%
  # join that set of counts to a complete date sequence
  left_join(data.frame(dates = seq(first(dates), last(dates), by = "day")), .) %>%
  # use rollapply to get sums of those counts across rolling windows that
  # are 4 days wide and right-aligned
  mutate(sum.a = rollapply(sum.a, width = 4, sum, na.rm = TRUE,
                               partial = TRUE, align = "right")) %>%
  # filter back to the original set of dates in df1
  filter(dates %in% df1$dates)

结果：

> df2
       dates sum.a
1 2011-10-09     1
2 2011-10-15     1
3 2011-10-16     2
4 2011-10-18     3
5 2011-10-21     2
6 2011-10-22     2
7 2011-10-24     3

事件随时间序列的滚动计数

Rolling Count of Events Over Time Series

r

dplyr

rollapply

data.table