根据 R 中的多列条件有效地分配新列值

Efficiently assigning a new column value based on multiple column conditions in R

我有一个数据框,其中包含有关许多卖家 ID 以及他们进行销售的时间段的信息。如果他们在接下来的 6 个周期内没有卖出,我想创建一个名为 inactive 的新列。

这是样本数据集的输入:

structure(list(SellerID = c(1, 7, 4, 3, 1, 7, 4, 2, 5, 1, 2, 
5, 7), Period = c(1, 1, 1, 2, 2, 3, 3, 5, 5, 9, 9, 10, 10)), .Names = c("SellerID", 
"Period"), row.names = c(NA, -13L), class = "data.frame")

这是我理想结果的输入(第 5 行的 Inactive 为 1,因为对于该行,sellerID 1 在第 2 期进行了销售,但他的下一次销售是在第 9 期 [第 10 行]。因此,他至少有 6 个时间段不活跃,因此我们要记录这一点,以便预测卖家何时不活跃):

structure(list(SellerID = c(1, 7, 4, 3, 1, 7, 4, 2, 5, 1, 2, 
5, 7), Period = c(1, 1, 1, 2, 2, 3, 3, 5, 5, 9, 9, 10, 10), Inactive = c(0, 
0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0)), .Names = c("SellerID", 
"Period", "Inactive"), row.names = c(NA, -13L), class = "data.frame")

我尝试使用 nest-for 循环方法解决此问题,但我的数据集非常大,需要很长时间才能 运行(大约 200,000 行)。我还在示例数据集上尝试了我的方法,但它似乎不起作用。下面是我的方法:

full.df$Inactive <- NA
for (i in 1:nrow(full.df)){
  temp = subset(full.df, SellerID = unique(full.df$SellerID[i]))
  for(j in 1:(nrow(temp) -1)){
    if(temp$Period[j+1] - temp$Period[j] <6)
      temp$Inactive[j] <-0
    else
      temp$Inactive[j] <-1
  }
  full.df[rownames(full.df) %in% rownames(temp), ]$Inactive <- temp$Inactive
}

虚拟数据集的输出,使用我的方法在 "Inactive" 中的所有行中放置一个 0,除了最后一行是 NA。这是我得到的输出的 dput:

structure(list(SellerID = c(1, 7, 4, 3, 1, 7, 4, 2, 5, 1, 2, 
5, 7), Period = c(1, 1, 1, 2, 2, 3, 3, 5, 5, 9, 9, 10, 10), Inactive = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA)), .Names = c("SellerID", 
"Period", "Inactive"), row.names = c(NA, -13L), class = "data.frame")

使用R --vanilla

# your input dataframe
d <- structure(list(SellerID = c(1, 7, 4, 3, 1, 7, 4, 2, 5, 1, 2, 
5, 7), Period = c(1, 1, 1, 2, 2, 3, 3, 5, 5, 9, 9, 10, 10)), .Names = c("SellerID", 
"Period"), row.names = c(NA, -13L), class = "data.frame")

# your wanted output
o <- structure(list(SellerID = c(1, 7, 4, 3, 1, 7, 4, 2, 5, 1, 2, 
5, 7), Period = c(1, 1, 1, 2, 2, 3, 3, 5, 5, 9, 9, 10, 10), Inactive = c(0, 
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0)), .Names = c("SellerID", 
"Period", "Inactive"), row.names = c(NA, -13L), class = "data.frame")

# 6 steps solution, step by step using vanilla R
# step1. - add tmp key for final sorting
d$tmp.key <- seq_len(nrow(d))
# step 2. - split by individual seller id
d.tmp <- split(d,f=d$SellerID)
# step 3. - add inactive column to individual sellers
d.tmp <- lapply(d.tmp,
    function(x){
       # Below as.numeric is optional
       # it may stay logical as well.
       # Also sorting by Period (not used here)
       # should be done (I am asuming it is sorted.)
       x$Inactive <- as.numeric(c(diff(x$Period) >= 6,FALSE))
       x
       })
# step 4. - assemble again individual sellers back into one data.frame
d <- do.call(rbind,d.tmp)
# step 5. - sort to original order using temp.key
d <- d[order(d$tmp.key),c("SellerID","Period","Inactive")]
# step 6. - rename rows according the row order
rownames(d) <- NULL

# here I am just comparing with your wanted ideal
> identical(d,o)    
[1] TRUE

对于具有 1 000 000 行和 1 个卖家的 data.frame,运行时间在普通 PC 上大约为 1 秒。

我在这里假设 1 件事。周期变量的最大范围是 12.

逻辑如下:您订购数据框。然后将 12 附加到列表的末尾并取差值。这还将对在 7 天范围内不活跃的卖家 3 进行分类。

df_s=df[with(df, order(SellerID, Period)),]
g=split(df$Period, df$SellerID)
l=lapply(g, function(x) c(x,12) )
j=lapply(l, diff)
u=unlist(j, use.names = F)
df_s$ind=ifelse(u>=7,1,0)