确定值更改的日期并使用 R 中的 sum() 和 diff() 汇总数据

Identify a value changes' date and summarize the data with sum() and diff() in R

示例数据:

 product_id <- c("1000","1000","1000","1000","1000","1000", "1002","1002","1002","1002","1002","1002")
    qty_ordered <- c(1,2,1,1,1,1,1,2,1,2,1,1)
    price <- c(2.49,2.49,2.49,1.743,2.49,2.49,  2.093,2.093,2.11,2.11,2.11, 2.97)
    date <- c("2/23/15","2/23/15",  '3/16/15','3/16/15','5/16/15',  "6/18/15",  "2/19/15","3/19/15","3/19/15","3/19/15","3/19/15","4/19/15")
    sampleData <- data.frame(product_id,    qty_ordered,    price,  date)

我想确定每次价格发生变化的时间。另外,我想对这两个价格变化日期之间的总 qty_ordered 求和。例如, 对于 product_id == "1000",价格在 2015 年 3 月 16 日从 2.49 美元更改为 1.743 美元。合计qty_ordered为1+2+1=4; 这两个最早的价格变动日期之间的差异是从 2/23/15 到 3/16/15,即 21 天。

所以新数据框应该是:

product_id sum_qty_ordered price    date_diff 
1000           4          2.490             21 
1000           1           1.743            61 
1000           2           2.490            33 

以下是我尝试过的方法:

**注意:对于这种情况,简单的“dplyr::group_by”将不起作用,因为它会忽略日期效果。

1) 我从 Determine when columns of a data.frame change value and return indices of the change 中找到了这段代码: 这是为了标识每次价格变化的时间,即标识每个产品价格变化的第一个日期。

IndexedChanged <- c(1,which(rowSums(sapply(sampleData[,3],diff))!=0)+1)
sampleData[IndexedChanged,]

但是,如果我使用该代码,我不确定如何计算每个条目的 sum(qty_ordered) 和日期差异。

2) 我尝试编写一个 WHILE 循环来临时存储每批 product_id、价格、日期范围(例如数据框的子集有一个 product_id、一个价格和所有条目的范围从价格变动的最早日期到价格变动前的最后日期), 然后,汇总该子集以获得总和(sum_qty_ordered)和日期差异。 但是,我想我总是对WHILE和FOR感到困惑,所以我的代码里面有一些问题。这是我的代码:

为以后的数据存储创建一个空数据框

 NewData_Ready <- data.frame(
                     product_id = character(),
                     price = double(),
                     early_date = as.Date(character()),
                     last_date=as.Date(character()),
                     total_qty_demanded = double(),                          
                     stringsAsFactors=FALSE) 

创建一个临时文件 table 来存储批量价格订单条目

 temp_dataset <- data.frame(
                     product_id = character(),
                     qty_ordered = double(),
                     price = double(),
                     date=as.Date(character()),                                  
                     stringsAsFactors=FALSE) 

循环: 这很乱......而且可能没有意义,所以我确实在这方面提供了帮助。

for ( i in unique(sampleData$product_id)){
    #for each unique product_id in the dataset, we are gonna loop through it based on product_id
    #for first product_id which is "1000"
    temp_table <- sampleData[sampleData$product_id == "i", ] #subset dataset by ONE single product_id
    #this dataset only has product of "1000" entries

    #starting a new for loop to loop through the entire entries for this product
    for ( p in 1:length(temp_table$product_id)){

        current_price <- temp_table$price[p] #assign current_price to the first price value
        #assign .49 to current price. 
        min_date <- temp_table$date[p] #assign the first date when the first price change
        #assign 2015-2-23 to min_date which is the earliest date when price is .49

        while (current_price == temp_table$price[p+1]){
        #while the next price is the same as the first price 
        #that is, if the second price is .49 is the same as the first price of .49, which is TRUE
        #then execute the following statement

            temp_dataset <- rbind(temp_dataset, temp_table[p,])
            #if the WHILE loop is TRUE, means every 2 entries have the same price
            #then combine each entry when price is the same in temp_table with the temp_dataset

            #if the WHILE loop is FALSE, means one entry's price is different from the next one
            #then stop the statement at the above, but do the following
            current_price <- temp_table$price[p+1]
            #this will reassign the current_price to the next price, and restart the WHILE loop

            by_idPrice <- dplyr::group_by(temp_dataset, product_id, price)
            NewRow <- dplyr::summarise(
                                early_date = min(date),
                                last_date = max(date),
                                total_qty_demanded = sum(qty_ordered))
            NewData_Ready <- rbind(NewData_Ready, NewRow)



        }
    }

}

我搜索了很多相关问题,但还没有找到与此问题相关的任何内容。如果您有一些建议,请告诉我。 另外,请就我的问题的解决方案提供一些建议。非常感谢您的宝贵时间和帮助!

Here is my R version:
platform       x86_64-apple-darwin13.4.0   
arch           x86_64                      
os             darwin13.4.0                
system         x86_64, darwin13.4.0        
status                                     
major          3                           
minor          3.1                         
year           2016                        
month          06                          
day            21                          
svn rev        70800                       
language       R                           
version.string R version 3.3.1 (2016-06-21)
nickname       Bug in Your Hair      

使用data.table:

library(data.table)
setDT(sampleData)

一些预处理:

sampleData[, firstdate := as.Date(date, "%m/%d/%y")]

根据您计算日期差异的方式,我们最好为每一行创建一个日期范围:

sampleData[, lastdate := shift(firstdate,type = "lead"), by = product_id]
sampleData[is.na(lastdate), lastdate := firstdate]
# Arun's one step: sampleData[, lastdate := shift(firstdate, type="lead", fill=firstdate[.N]), by = product_id]

然后为每次价格变化创建一个新 ID:

sampleData[, price_id := cumsum(c(0,diff(price) != 0)), by = product_id]

然后按产品和价格计算分组函数 运行:

sampleData[,
           .(
             price = unique(price),
             sum_qty = sum(qty_ordered),
             date_diff = max(lastdate) − min(firstdate) 
           ),
           by = .(
             product_id,
             price_id
           )
           ]

   product_id price_id price sum_qty date_diff
1:       1000        0 2.490       4   21 days
2:       1000        1 1.743       1   61 days
3:       1000        2 2.490       2   33 days
4:       1002        0 2.093       3   28 days
5:       1002        1 2.110       4   31 days
6:       1002        2 2.970       1    0 days

我认为1000的最后一次价格变化只有33天,而前一次是61(不是60)。如果包括第一天,则为 22、62 和 34,该行应显示为 date_diff = max(lastdate) − min(firstdate) + 1