从右到左计算 1 的个数,停在第一个 0

Count number of 1's from right to left, stopping at the first 0

我想计算多列中从右到左出现的 1 的数量,遇到第一个 0 时停止。

示例 DF:

df<-data.frame(replicate(7,sample(0:1,30,rep=T)))
colnames(df)<-seq(1950,2010,10)

例如,我在新列 "condition" 下手动输入了所需的结果:

在此先感谢您的帮助,

我们可以遍历行,使用rle

df$condition <- apply(df, 1, function(x) {x1 <- rle(x)
      x2 <- tail(x1$lengths, 1)[tail(x1$values, 1)==1]
      if(length(x2)==0) 0 else x2})

或者另一种选择是 str_extract

library(stringr)
v1 <- str_extract(do.call(paste0, df), "1+$")
d$condition <- ifelse(is.na(v1), 0, nchar(v1))

或者用稍微高效的stringi

library(stringi)
v1 <- stri_count(stri_extract(do.call(paste0, df), regex = "1+$"), regex = ".")
v1[is.na(v1)] <- 0
df$condition <- v1

或使用更紧凑的选项

stri_count(do.call(paste0, df), regex = '(?=1+$)')

[编辑:现在有效]

试试这个

df$condition <-  apply(df,1,function(x){x<- rev(x);m <- match(0,x)[1]; if (is.na(m)) sum(x) else sum(x[1:m])})

我们正在匹配第一个 0,然后求和直到这个元素。 如果没有零,我们对整行求和


这是所有解决方案的基准:

library(stringr)
microbenchmark(
Moody_Mudskipper =  apply(df,1,function(x){x<- rev(x);m <- match(0,x)[1]; if (is.na(m)) sum(x) else sum(x[1:m])}),
akrun =  apply(df, 1, function(x) {x1 <- rle(x)
                                          x2 <- tail(x1$lengths, 1)[tail(x1$values, 1)==1]
                                          if(length(x2)==0) 0 else x2}),
akrun2 = str_count(do.call(paste0, df), "[1]+$"),
roland = apply(df, 1, function(x) {y <- rev(x);sum(y * cumprod(y != 0L))}),
David_Arenburg  = ncol(df) - max.col(-df, ties = "last"),
times = 10)

# Unit: microseconds
#                     expr      min       lq      mean   median       uq      max neval
#         Moody_Mudskipper 1437.948 1480.417 1677.1929 1536.159 1597.209 3009.320    10
#                    akrun 6985.174 7121.078 7718.2696 7691.053 7856.862 9289.146    10
#                   akrun2 1101.731 1188.793 1290.8971 1226.486 1343.099 1790.091    10
#                   akrun3  693.315  791.703  830.3507  820.371  884.782 1030.240    10
#                   roland 1197.995 1270.901 1708.5143 1332.305 1727.802 4568.660    10
#           David_Arenburg 2845.459 3060.638 3406.3747 3167.519 3495.950 5408.494    10
# David_Arenburg_corrected 3243.964 3341.644 3757.6330 3384.645 4195.635 4943.099    10

举个更大的例子,David 的解决方案确实是最快的,正如所选解决方案的评论中所说:

df<-data.frame(replicate(7,sample(0:1,1000,rep=T)))

# Unit: milliseconds
#                     expr        min         lq       mean     median         uq        max neval
#         Moody_Mudskipper  31.324456  32.155089  34.168533  32.827345  33.848560  44.952570    10
#                    akrun 225.592061 229.055097 238.307506 234.761584 241.266853 271.000470    10
#                   akrun2  28.779824  29.261499  33.316700  30.118144  38.026145  46.711869    10
#                   akrun3  14.184466  14.334879  15.528201  14.633227  17.237317  18.763742    10
#                   roland  27.946005  28.341680  29.328530  28.497224  29.760516  33.692485    10
#           David_Arenburg   3.149823   3.282187   3.630118   3.455427   3.727762   5.240031    10
# David_Arenburg_corrected   3.464098   3.534527   4.103335   3.833937   4.187141   6.165159    10
df$condition <- apply(df, 1, function(x) {
  y <- rev(x)
  sum(cumprod(y))
})

这是一个完全矢量化的尝试

indx <- rowSums(df) == ncol(df) # Per Jaaps comment
df$condition <- ncol(df) - max.col(-df, ties = "last")
df$condition[indx] <- ncol(df) - 1

这基本上是从右边找到第一个零并计算之前的列数(基本上是二进制数据中的 1s)

编辑

必须添加对所有行都是 1 的特殊情况的处理