在 R 中的每个观察之前在时间 window 内找到所有相同的交易

Question

我有一个工单数据库，其中包含 material 为不同工厂保留的工单。我需要找出每个 material 在当前观察之前的时间 window 中被请求了多少次。

我尝试了以下方法，但是对于 700 万行的数据框，我需要 300 个月。

result<-data.frame()
for (i in 1:nrow(work.orders)){
  wo.date_f<- work.orders$original_basic_start_date[i] %>% as_date()
  days.back<-30 # Time windows for searching
  mat_f<- work.orders$material[i]
  plant_f<-work.orders$plant[i]
  total_resb_found<-work.orders %>% filter(plant==plant_f,
                       material==mat_f,
                       (original_basic_start_date %>% as_date())<wo.date_f,
                       (original_basic_start_date %>% as_date())-wo.date_f<days.back) %>% nrow()
  result[i,1]<-mat_f
  result[i,2]<-plant_f
  result[i,3]<-total_resb_found
  print(paste0(i," out of ", nrow(work.orders)))
}

所以方法是：

查找第 1 行的日期、material 和工厂。
筛选 material 的主数据框和在上一步中找到的植物
为在步骤 1 中找到的日期之前以及在所述日期时间之后的日期过滤数据框 window
计算找到了多少行
在数据框中记录结果

我知道这是一种蛮力方法，因此绝对有很大的改进空间，但我一直想不出一个好的方法关于更有效的方法有什么想法吗？

谢谢

编辑：添加了具有所需结果的示例数据（作为结果列）

 structure(list(material = c("000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010199498", "000000000010339762", 
"000000000010199498", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010199498", "000000000010339762", 
"000000000010339762", "000000000010199498", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010199498", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010199498", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010199498", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010199498", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010199498", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010199498", "000000000010199498", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010199498", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010199498", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762", "000000000010199498", "000000000010199498", 
"000000000010339762", "000000000010339762", "000000000010339762", 
"000000000010339762"), original_basic_start_date = c("20201106", 
"20200702", "20200618", "20200923", "20201205", "20201118", "20201231", 
"20201022", "20200819", "20200823", "20201118", "20201106", "20201106", 
"20200826", "20201123", "20201111", "20200912", "20200930", "20200916", 
"20200717", "20200702", "20200929", "20201016", "20200624", "20201123", 
"20201105", "20200620", "20200626", "20200608", "20200712", "20200616", 
"20201209", "20200904", "20210130", "20201123", "20201117", "20200620", 
"20210311", "20200812", "20201130", "20200816", "20200802", "20200805", 
"20200705", "20200731", "20201123", "20200703", "20200926", "20200718", 
"20201105", "20200618", "20200815", "20200609", "20200729", "20210112", 
"20201121", "20210110", "20201008", "20200705", "20200902", "20200918", 
"20200713", "20200828", "20200722", "20210126", "20200906", "20200902", 
"20210106", "20201229", "20210117", "20210321", "20210204", "20201005", 
"20210109", "20210106", "20210526", "20210203", "20201001", "20201206", 
"20210221", "20210103", "20210112", "20200719", "20201019", "20200707", 
"20210110", "20210219", "20200710", "20210331", "20210103", "20201222", 
"20201027", "20210116", "20210130", "20200709", "20210120", "20210208", 
"20210220", "20201106", "20210205", "20201223", "20210211", "20210303", 
"20210428", "20200724", "20210203", "20200831", "20210207"), 
    plant = structure(c(16L, 34L, 16L, 16L, 21L, 20L, 10L, 15L, 
    25L, 37L, 20L, 37L, 37L, 10L, 17L, 25L, 32L, 39L, 25L, 16L, 
    37L, 17L, 19L, 25L, 27L, 16L, 39L, 34L, 17L, 19L, 17L, 25L, 
    16L, 17L, 16L, 16L, 36L, 17L, 16L, 27L, 16L, 37L, 16L, 16L, 
    17L, 16L, 39L, 32L, 19L, 27L, 19L, 16L, 16L, 19L, 20L, 15L, 
    39L, 16L, 16L, 16L, 16L, 16L, 16L, 25L, 16L, 16L, 19L, 25L, 
    19L, 27L, 17L, 33L, 32L, 33L, 15L, 25L, 25L, 16L, 19L, 21L, 
    16L, 20L, 16L, 16L, 37L, 16L, 16L, 37L, 25L, 27L, 37L, 32L, 
    37L, 37L, 19L, 21L, 20L, 17L, 27L, 47L, 16L, 16L, 25L, 25L, 
    16L, 32L, 16L, 20L), .Label = c("B201", "B21B", "B401", "B501", 
    "B504", "B521", "B701", "B71A", "B991", "C602", "C603", "C690", 
    "CS01", "CY01", "CY02", "CY04", "CY05", "CY06", "CY07", "CY08", 
    "CY09", "CY11", "CY12", "CY13", "CY16", "CY21", "CY30", "CY91", 
    "CY95", "D106", "D192", "FX01", "FX03", "FX05", "FX06", "FX07", 
    "FX10", "FX1A", "FX1C", "FX1E", "FX21", "FX5A", "FX5B", "FX5C", 
    "FX92", "FX94", "KB01", "PA02", "PA04", "PA05", "PA12", "PB1A", 
    "PB51", "PI01", "PI03", "PI08", "PI0A", "PI0B", "PI0F", "PN9A", 
    "PN9B", "PN9D", "PN9E", "PP9A", "PR90", "PR92", "PT01", "PT02", 
    "PT07", "PT08", "S501", "S502", "S503", "S504", "S505", "S507", 
    "S50I", "S516", "S517", "S593", "U201"), class = "factor"), 
    Result = c(23L, 1L, 1L, 18L, 0L, 0L, 0L, 0L, 2L, 4L, 0L, 
    5L, 5L, 0L, 4L, 4L, 0L, 2L, 3L, 5L, 0L, 3L, 6L, 0L, 2L, 22L, 
    0L, 0L, 0L, 2L, 1L, 5L, 15L, 5L, 25L, 24L, 0L, 7L, 9L, 3L, 
    11L, 3L, 8L, 2L, 2L, 25L, 1L, 1L, 3L, 0L, 0L, 10L, 0L, 4L, 
    2L, 1L, 3L, 20L, 2L, 14L, 17L, 4L, 12L, 1L, 30L, 16L, 5L, 
    6L, 8L, 5L, 8L, 0L, 2L, 0L, 2L, 11L, 7L, 19L, 7L, 2L, 28L, 
    2L, 6L, 21L, 1L, 29L, 32L, 2L, 9L, 4L, 7L, 3L, 0L, 8L, 1L, 
    1L, 5L, 6L, 1L, 0L, 27L, 31L, 8L, 10L, 7L, 4L, 13L, 4L)), .Names = c("material", 
"original_basic_start_date", "plant", "Result"), row.names = c(NA, 
-108L), class = "data.frame")

Answer 1

这是一个尝试：


plant.pool <- 1:4
material.pool <- 1:5
date.pool <- seq(as.Date('2010/01/01'), as.Date(now()), by="day")
n.out <- 7e6

## Generate 7 mill records
d <- data.table(
    plant = sample( plant.pool, size=n.out, replace=TRUE ),
    material = sample( material.pool, size=n.out, replace=TRUE ),
    original_basic_start_date = sample( date.pool, size=n.out, replace=TRUE ),
    id = 1:n.out
)[ order(original_basic_start_date) ]

window.size <- 30

## Do the actual rolling calculation, and time it, demonstrating the first 100k only
system.time( the.counts <- frollapply(
    d[ 1:1e5, .(id) ],
    window.size,
    function(x){
        d[ id %in% x, sum( plant==last(plant) & material==last(material) ) ]
    }
))

如您所见，我特意只处理前 100k 条记录，并对其计时。

在我的老旧工作站（8-9 岁，8 核）上运行几次后，它处理这 10 万条记录的时间大约为 17 秒。这可能会推断为 3.3 小时，这比 300 个月快得多。你的电脑也可能比我的快，所以一切都更好。一旦你看到你的速度或更快，你可以像我一样计时 100k 行，然后启动它并在应该完成时返回检查。

data.table 通常适用于需要更快的大事情。虽然那里可能有更合适的功能（你可能不得不看看 python 或 c 甚至），虽然我没有' t 找到了一个解决方案，可以在其优化的滚动函数中同时查看 material 和植物列 (frollsum w/friends)

编辑：

你在我第一次尝试后添加了数据，但我认为我的方法原则仍然成立。

编辑 2：

另外 rollapply 不处理第一个 1:window.size 行，你必须自己处理它们，（这应该是一个相当简单的事情，因为你有 300 个月减去你手上的几个小时)

Answer 2

导致时间增加的当前方法的问题是这些代码行

result[i,1]<-mat_f
result[i,2]<-plant_f
result[i,3]<-total_resb_found

向矩阵添加一行，R 中的数据帧效率非常低，需要大量额外的工作，这在行数较少时会很快。一旦行数达到某个阈值，成本就会乘以每个新行。所以你估计 300 个月可能有点乐观:)

这是我测试各种东西的个人经验。尽管我还没有真正量化或对此进行彻底的研究。但是我确定它与内存分配有关，类似于性能问题，当您为 100 个元素的数组预分配内存而不是定义一个空数组并一次添加一个项目时。

更新方法

经过一些压力测试和进一步澄清后，这里是我更新的代码以获得更好的性能

library(doParallel)
library(dplyr)
library(lubridate)
library(purrr)

# Parallel on 4 cores
registerDoParallel(4)

# Create a df which is 70000 your sample data (70000 x 108 = 7.5M records)
work.orders <- purrr::map_dfr(seq_len(70000), ~df)
# To allow more properly testing the data, a random date in span of one and a half years was generated and assign to original_basic_start_date
random_dates <- as.Date(rdunif(nrow(work.orders), as.integer(as.Date("2020-01-01")), as.integer(as.Date("2021-06-01"))), origin = "1970-01-01")
work.orders$original_basic_start_date <- random_dates
work.orders <- work.orders %>% arrange(original_basic_start_date)

# As the same material & plant on same date will return same results
# Therefore we only need to calculate once for those. Here is filter the 
# original data to only contain no duplicated date records for a pair of 
# material & plant
work.orders_unique_date <- work.orders %>% 
  group_by(material, plant) %>%
  filter(!duplicated(original_basic_start_date)) %>%
  ungroup() %>% 
  arrange(original_basic_start_date)
# Get all the date presented in the data
all_date_in_data <- unique(work.orders_unique_date[["original_basic_start_date"]])
# Time windows for searching
days.back <- 30 
# Time the performance
system.time(
  # Here for each date we filter the original data frame to only have date
  # that in the range of defined windows which allow faster calculation
  all_result <- bind_rows(foreach(i_date = all_date_in_data) %do% {
    message(sprintf("%s Process date: %s", Sys.time(), i_date))
    smaller_df <- work.orders %>% 
      filter(original_basic_start_date < i_date) %>%
      filter(original_basic_start_date - i_date < days.back)
    unique_item_on_date <- work.orders_unique_date %>%
      filter(original_basic_start_date == i_date)
    # for one date, calculate the requirement result for all pair of
    # material & plant on that date
    result <- bind_rows(foreach(i = 1:nrow(unique_item_on_date)) %dopar% {
      mat_f <- unique_item_on_date$material[i]
      plant_f <- unique_item_on_date$plant[i]
      # Here using the smaller df which already filter by date windows
      total_resb_found <- smaller_df %>%
        # Separate condition into multiple filter which should speed it up a bit
        filter(plant == plant_f, material == mat_f)
        nrow()
        tibble(date = i_date, mat_f, plant_f, total_resb_found)
    })
    result
  })
)

上述过程的一些消息输出。每天最多持续 2 秒。如果您的数据在 2 年的跨度内有 700 万条记录，那么它应该需要大约 1 小时甚至更少

2021-03-12 08:58:41 Done Process date: 2020-08-15 in 1.83 seconds
2021-03-12 08:58:42 Done Process date: 2020-08-16 in 1.66 seconds
2021-03-12 08:58:44 Done Process date: 2020-08-17 in 1.93 seconds
2021-03-12 08:58:46 Done Process date: 2020-08-18 in 1.72 seconds
2021-03-12 08:58:48 Done Process date: 2020-08-19 in 1.77 seconds
2021-03-12 08:58:50 Done Process date: 2020-08-20 in 1.74 seconds
2021-03-12 08:58:51 Done Process date: 2020-08-21 in 1.78 seconds
2021-03-12 08:58:53 Done Process date: 2020-08-22 in 1.78 seconds
2021-03-12 08:58:55 Done Process date: 2020-08-23 in 1.73 seconds
2021-03-12 08:58:57 Done Process date: 2020-08-24 in 1.94 seconds
2021-03-12 08:58:59 Done Process date: 2020-08-25 in 1.78 seconds
2021-03-12 08:59:00 Done Process date: 2020-08-26 in 1.78 seconds
2021-03-12 08:59:02 Done Process date: 2020-08-27 in 1.81 seconds
2021-03-12 08:59:04 Done Process date: 2020-08-28 in 1.84 seconds
2021-03-12 08:59:06 Done Process date: 2020-08-29 in 1.83 seconds
2021-03-12 08:59:08 Done Process date: 2020-08-30 in 1.76 seconds
2021-03-12 08:59:10 Done Process date: 2020-08-31 in 1.94 seconds
2021-03-12 08:59:11 Done Process date: 2020-09-01 in 1.78 seconds
2021-03-12 08:59:13 Done Process date: 2020-09-02 in 1.86 seconds
2021-03-12 08:59:15 Done Process date: 2020-09-03 in 1.82 seconds
2021-03-12 08:59:17 Done Process date: 2020-09-04 in 1.80 seconds
2021-03-12 08:59:19 Done Process date: 2020-09-05 in 1.88 seconds
2021-03-12 08:59:20 Done Process date: 2020-09-06 in 1.78 seconds
2021-03-12 08:59:23 Done Process date: 2020-09-07 in 2.08 seconds
2021-03-12 08:59:24 Done Process date: 2020-09-08 in 1.89 seconds
2021-03-12 08:59:26 Done Process date: 2020-09-09 in 1.88 seconds
2021-03-12 08:59:28 Done Process date: 2020-09-10 in 1.86 seconds
2021-03-12 08:59:30 Done Process date: 2020-09-11 in 1.87 seconds

前 100 天的时间是 60 秒

   user  system elapsed 
 93.281  73.571  60.310

这是示例结果

> tail(all_result, 20)
# A tibble: 20 x 4
   date       mat_f              plant_f total_resb_found
   <date>     <chr>              <fct>              <int>
 1 2020-04-08 000000000010199498 C602               13318
 2 2020-04-09 000000000010339762 FX01               66596
 3 2020-04-09 000000000010339762 CY04              441597
 4 2020-04-09 000000000010199498 CY16              160625
 5 2020-04-09 000000000010199498 FX03               13350
 6 2020-04-09 000000000010199498 FX10               13418
 7 2020-04-09 000000000010339762 CY07              120541
 8 2020-04-09 000000000010339762 CY30               80768
 9 2020-04-09 000000000010339762 FX10              120076
10 2020-04-09 000000000010339762 FX03               13498
11 2020-04-09 000000000010199498 C602               13448
12 2020-04-09 000000000010339762 FX1C               53672
13 2020-04-09 000000000010339762 CY08               80234
14 2020-04-09 000000000010339762 CY05              120682
15 2020-04-09 000000000010339762 CY09               40493
16 2020-04-09 000000000010339762 FX07               13325
17 2020-04-09 000000000010339762 CY02               40204
18 2020-04-09 000000000010339762 FX05               26671
19 2020-04-09 000000000010339762 C602               13576
20 2020-04-09 000000000010339762 KB01               13331

[更新：增加处理全部 750 万条记录的时间]

    user   system  elapsed 
1665.785 1096.865  891.837

在 R 中的每个观察之前在时间 window 内找到所有相同的交易

Find all transaction of same within time window before each observation in R

performance

r

processing-efficiency

dplyr

更新方法

[更新：增加处理全部 750 万条记录的时间]