如何知道在上一个订单的 delivery/receiving 之前下了下一个订单的客户?在 R

How to know customers who placed next order before delivery/receiving of earlier order? In R

我有一个包含两个日期的大型数据库。例如。取超市数据(http://www.tableau.com/sites/default/files/training/global_superstore.zip)'Orders'Sheet.

一个日期是订单日期,另一个是 shipping/delivery 日期(假设它是交货日期)。我想知道下一个订单的客户的所有订单的详细信息,而没有等待他们之前任何一个订单的 shipping/delivery。

例如ID 为 'ZC-21910' 的客户于 2014 年 6 月 12 日下了 ID 为 CA-2014-133928 的订单,该订单于 2014 年 6 月 18 日发货。但是,同一客户在 2014 年 6 月 13 日下了 ID 为 'IT-2014-3511710' 的下一个订单,即2014 年 6 月 18 日之前(较早订单之一的发货日期)。

最好在单独的 dataframe/table.

中过滤掉所有此类订单(订单 ID)

如何在 R 中实现?

示例数据集

> dput(df)
structure(list(customer_id = c("A", "A", "A", "B", "B", "C", 
"C"), order_id = structure(1:7, .Label = c("1", "2", "3", "4", 
"5", "6", "7"), class = "factor"), order_date = structure(c(17897, 
17901, 17912, 17901, 17902, 17903, 17905), class = "Date"), ship_date = structure(c(17926, 
17906, 17914, 17904, 17904, 17904, 17906), class = "Date")), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))

以下是我将如何在 R 中构建此工作流,请注意:在 Tableau 中复制该功能将非常困难。

# Install pacakges if they are not already installed: necessary_packages => vector
necessary_packages <- c("readxl")

# Create a vector containing the names of any packages needing installation:
# new_pacakges => vector
new_packages <- necessary_packages[!(necessary_packages %in%
                                       installed.packages()[, "Package"])]

# If the vector has more than 0 values, install the new pacakges
# (and their) associated dependencies:
if(length(new_packages) > 0){install.packages(new_packages, dependencies = TRUE)}

# Initialise the packages in the session:
lapply(necessary_packages, require, character.only = TRUE)

# Store a scalar of the link to the data: durl => character scalar
durl <- "http://www.tableau.com/sites/default/files/training/global_superstore.zip"

# Store the path to the temporary directory: tmpdir_path => character scalar
tmpdir_path <- tempdir()

# Store a character scalar denoting the link to the zipped directory
# that is to be created: zip_path => character scalar
zip_path <- paste0(tmpdir_path, "/tableau.zip")

# Store a character scalar denoting the link to the unzipped directory
# that is to be created: unzip_path => character scalar
unzip_path <- paste0(tmpdir_path, "/global_superstore")

# Download the zip file: global_superstore.zip => stdout (zip_path)
download.file(durl, zip_path)

# Unzip the file into the unzip directory: tableau.zip => stdout (global_superstore)
unzip(zipfile = zip_path, exdir = unzip_path)

# Read in the excel file: df => data.frame
df <- read_xls(normalizePath(list.files(unzip_path, full.names = TRUE)))

# Regex the vector names to fit with R convention: names(df) => character vector 
names(df) <- gsub("\W+", "_", tolower(trimws(names(df), "both")))

# Allocate some memory by creating an empty list the same size as the number of 
# customers: df_list => list
df_list <- vector("list", length(unique(df$customer_id)))

# Split the data.frame into the list by the customer_id: df_list => lis
df_list <- with(df, split(df, customer_id))      

# Sort the data (by date) and test whether or not each customer waited for their 
# order before ordering again: orders_prior_to_delivery => data.frame
orders_prior_to_delivery <- data.frame(do.call("rbind", Map(function(x){
  # Order the data.frame: y => data.frame
  y <- x[order(x$order_date),]
  # Return only the observations where the customer didn't wait: 
  # data.frame => GlobalEnv()
  with(y, y[c(FALSE, 
    apply(data.frame(sapply(order_date[-1], `<`, ship_date[-nrow(y)])), 2, any)),])
}, 
df_list)), row.names = NULL, stringsAsFactors = FALSE)

# Unique customers and orders that were ordered prior to shipping the 
# previous order: cust_orders_prior_to_delivery => data.frame
cust_orders_prior_to_delivery <- 
  unique(orders_prior_to_delivery[,c("order_id", "customer_id")])

编辑:我之前的回答没有正确处理订单日期 == 发货日期的情况。

我假设您已经将数据加载到名为 df 的对象中。您可以使用@hello_friend 代码的第一部分来获取它。

library(tidyverse)
df %>% 
  distinct(`Customer ID`, `Order ID`, `Order Date`, `Ship Date`) %>% 
  arrange(`Customer ID`, `Order Date`, `Ship Date`) %>% 
  mutate(sort_key = row_number()) %>% 
  pivot_longer(c(`Order Date`, `Ship Date`), names_to = "Activity", names_pattern = "(.*) Date", values_to = "Date") %>% 
  mutate(Activity = factor(Activity, ordered = TRUE, levels = c("Order", "Ship")), 
         Open = if_else(Activity == "Order", 1, -1)) %>% 
  group_by(`Customer ID`) %>% 
  arrange(Date, sort_key, Activity, .by_group = TRUE) %>% 
  mutate(Open = cumsum(Open)) %>% 
  ungroup %>% 
  filter(Open > 1, Activity == "Order") %>% 
  select(`Customer ID`, `Order ID`)

首先,只接受不同的订单和客户 ID,否则来自同一订单的多个项目会混淆并导致不正确的结果。然后,旋转数据,使每个订单成为两行,每行代表一个不同的 activity:订购或发货。我们创建了 运行 个未结订单总数。你要找什么时候变成两个或更多。

我使用 Activity 的有序因子来确保我总是在关闭订单之前打开订单。当订单日期和发货日期相同时,这很重要。

我使用一个特殊的 sort_key 栏来确保我在打开新订单之前关闭旧订单,以防客户在同一天订购其他东西。您可能需要相反的逻辑。

所有这些都假设给定的客户 ID 和订单 ID 在数据中只出现一次,这在您的数据集中实际上是不正确的,如您所见:

df %>% group_by(`Customer ID`, `Order ID`) %>% filter(n_distinct(`Ship Date`)> 1) %>% select(1:9)