R中的洗牌向量,但相同的元素应该有最小距离

Shuffle Vector in R, But Identical Elements Should Have Minimum Distance

我想要randomize/shuffle一个向量。一些矢量元素是相同的。洗牌后,相同元素的最小距离应为三(即其他两个元素应位于相同元素之间)。

考虑 R 中的以下示例向量:

x <- rep(LETTERS[1:5], 3)  # Create example vector
x
#  [1] "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" "E"

如果我使用 sample 函数打乱我的向量,一些相同的元素可能靠得太近。例如,如果我使用以下 R 代码,元素“C”将紧接着出现在位置 5 和 6 处:

set.seed(53135)
sample(x)                  # sample() function puts same elements too close
#  [1] "B" "A" "E" "D" "C" "C" "E" "A" "B" "C" "D" "E" "A" "D" "B"

如何确保相同元素的最小距离为三?

所以基本上我们需要有条件地从 x 向量中抽取一个在 min.dist-1 运行中未被选择的元素。使用 purrr 的 reduce 我们可以实现这个:

min.dist <- 2
reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
[1] "A" "E" "D" "B" "A" "D" "E" "C" "D" "A" "C" "E" "B" "A" "E"

捆绑在一个函数中

shuffle <- function(x, min.dist=2){
    stopifnot(min.dist < length(unique(x)))
    reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
}
> shuffle(x, 3)
 [1] "A" "C" "B" "D" "E" "A" "B" "C" "E" "D" "A" "B" "C" "E" "A"
> shuffle(x, 3)
 [1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "D" "E" "C" "A"
> shuffle(x, 4)
 [1] "C" "E" "D" "A" "B" "C" "E" "D" "A" "B" "C" "E" "D" "A" "B"
> shuffle(x, 4)
 [1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "B" "D" "E" "C"
> shuffle(x, 2)
 [1] "E" "A" "D" "E" "B" "D" "A" "E" "C" "D" "A" "E" "C" "A" "B"
> shuffle(x, 2)
 [1] "B" "A" "D" "C" "B" "A" "E" "B" "A" "E" "B" "C" "D" "A" "E"

@27ϕ9 评论后:

shuffle <- function(x, min.dist=2){
    stopifnot(min.dist < length(unique(x)))
    reduce(integer(length(x)-1), ~ c(.x, sample(x[!x %in% tail(.x, min.dist) &( x %in% names(t <- table(x[x%in%.x]) > table(.x))[t] | !x %in% .x)], 1)), .init=sample(x,1))
}
> table(shuffle(rep(LETTERS[1:5], 3),2))

A B C D E 
3 3 3 3 3 
> table(shuffle(rep(LETTERS[1:5], 3),2))
Error in sample.int(length(x), size, replace, prob) : 
  invalid first argument

更新

经过反复试验,考虑到并非总是有足够的元素来 space 出 min.dist 这一事实,我想出了一个解决方案,这段代码解释得最多来自上面的:

shuffle <- function(x, min.dist=2){
    stopifnot(min.dist < length(unique(x)))
    reduce(integer(length(x)-1), function(.x, ...){
        # whether the value is in the tail of the aggregated vector
        in.tail <- x %in% tail(.x, min.dist)
        # whether a value still hasn't reached the max frequency
        freq.got <- x %in% names(t<-table(x[x%in%.x]) > table(.x))[t]
        # whether a value isn't in the aggregated vector
        yet <- !x %in% .x
        # the if is there basically to account for the cases when we don't have enough vars to space out the vectors
         c(.x, if(any((!in.tail & freq.got) | yet )) sample(x[(!in.tail & freq.got) | yet ], 1) else  x[which(freq.got)[1]] )
    }, .init=sample(x,1))
}

现在 运行 table(shuffle(rep(LETTERS[1:5], 3),2)) 对于所有变量总是 return 3 我们可以肯定地说在向量中变量是 spaced最小距离为 2。保证没有元素重复的唯一方法是使用 min.dist=length(unique(x))-1,否则会出现最大 r < min.dist 元素与其最后一次出现的距离不 min.dist 的情况,如果存在这样的元素,它们将位于结果向量的 length(x) + 1 - 1:min.dist 子集中。

只是为了完全确定使用循环来检查输出向量的尾部是否具有唯一值:(删除打印语句我只是为了演示目的使用它)

shuffler <- function(x, min.dist=2){
    while(!length(unique(print(tail(l<-shuffle(x, min.dist=min.dist), min.dist+1))))==min.dist+1){}
    l
}

table(print(shuffler(rep(LETTERS[1:5], 3),2)))
 [1] "A" "B" "C" "E" "B" "C" "D" "A" "C" "D" "A" "E" "B" "D" "E"

A B C D E 
3 3 3 3 3 

table(print(shuffler(rep(LETTERS[1:5], 3),2)))
[1] "D" "C" "C"
[1] "C" "C" "E"
[1] "C" "A" "C"
[1] "D" "B" "D"
[1] "B" "E" "D"
 [1] "C" "A" "E" "D" "A" "B" "C" "E" "A" "B" "D" "C" "B" "E" "D"

A B C D E 
3 3 3 3 3 

更新:

shuffler <- function(x, min.dist=2){
    while(any(unlist(lapply(unique(tl<-tail(l<-shuffle(x, min.dist=min.dist), 2*min.dist)), function(x) diff(which(tl==x))<=min.dist)))){}
    l
}

这个新版本对向量尾部的元素是否min.dist进行了严格的测试,之前的版本适用于min.dist=2,但是这个新版本做了更好的测试。

我希望这个答案对你有用。它是用 base R 完成的,但它有效。如果你想逐行检查,我离开打印:

x <- rep(LETTERS[1:5], 3)  # Create example vector


shuffle <- function(x, min_dist=3){
  #init variables   
  result<-c() # result vector
  count<-0
  vec_use<-x
  vec_keep<-c()
  for(i in 1:length(x)){
#    print(paste0("iteration =", i))
    if (count>min_dist){
      valback<-vec_keep[1]
#      print(paste0("value to be returned:",  valback))
      ntimes_valback<-(table(vec_keep)[valback])
      vec_use<- c(vec_use,rep(valback,ntimes_valback))
#      print(paste0("vec_use after giving back valbak =", valback))
#      print(paste0(vec_use,","))
      vec_keep <- vec_keep[!vec_keep %in% valback]
#      print(paste0("vec_keep after removing valback =", valback))
#      print(paste0(vec_keep,","))
    }
    val<-sample(vec_use,1)
#    print(paste0("val = ",val))#remove value
    vec_keep<- c(vec_keep,x[x %in% val])
    vec_keep<-vec_keep[1:(length(vec_keep)-1)]#removing 1 letter
#    print(paste0("vec_keep ="))
#    print(paste0(vec_keep,","))
    vec_use <- vec_use[!vec_use %in% val]
#    print(paste0("vec_use ="))
#    print(paste0(vec_use,","))
    result[i]<-val
    count<-count+1
    }
  return(result)
}
shuffle(x)
"C" "D" "B" "E" "C" "A" "B" "D" "E" "A" "C" "D" "B" "E" "C"

如果您的数据很大,那么依靠概率来完成这类任务可能会(方式)更快。

这是一个例子:

prob_shuffler = function(x, min.dist = 2){
    n = length(x)
    res = sample(x)
    OK = FALSE
    
    # We loop until we have a solution
    while(!OK){
        OK = TRUE
        for(i in 1:min.dist){
            # We check if identical elements are 'i' steps away
            pblm = res[1:(n-i)] == res[-(1:i)]
            if(any(pblm)){
                if(sum(pblm) >= (n - i)/2){
                    # back to square 1
                    res = sample(x)
                } else {
                    # we pair each identical element with 
                    # an extra one
                    extra = sample(which(!pblm), sum(pblm))
                    id_reshuffle = c(which(pblm), extra)
                    res[id_reshuffle] = sample(res[id_reshuffle])
                }

                # We recheck from the beginning
                OK = FALSE
                break
            }
        }
    }

    res
}

尽管 while 循环看起来很可怕,但实际上收敛速度很快。当然,两个字符在 min.dist 远的概率越低,收敛越快。

@Abdessabour Mtk 和@Carles Sans Fuentes 的当前解决方案有效,但根据输入数据的大小,速度很快会变得非常慢。这是一个基准:

library(microbenchmark)

x = rep(c(letters, LETTERS), 10)
length(x)
#> [1] 520

microbenchmark(prob_shuffler(x, 1), shuffler_am(x, 1), shuffler_csf(x, 1), times = 10)
#> Unit: microseconds
#>                 expr       min        lq       mean    median        uq        max neval
#>  prob_shuffler(x, 1)    87.001   111.501    155.071   131.801   192.401    264.401    10
#>    shuffler_am(x, 1) 17218.100 18041.900  20324.301 18740.351 22296.301  26495.200    10
#>   shuffler_csf(x, 1) 86771.401 88550.501 118185.581 95582.001 98781.601 341826.701    10

microbenchmark(prob_shuffler(x, 2), shuffler_am(x, 2), shuffler_csf(x, 2), times = 10)
#> Unit: microseconds
#>                 expr     min        lq       mean    median        uq        max neval
#>  prob_shuffler(x, 2)   140.1   195.201   236.3312   245.252   263.202    354.101    10
#>    shuffler_am(x, 2) 18886.2 19526.901 22967.6409 21021.151 26758.800  29133.400    10
#>   shuffler_csf(x, 2) 86078.1 92209.901 97151.0609 97612.251 99850.101 107981.401    10

microbenchmark(prob_shuffler(x, 3), shuffler_am(x, 3), shuffler_csf(x, 3), times = 10)
#> Unit: microseconds
#>                 expr       min        lq        mean     median       uq        max neval
#>  prob_shuffler(x, 3)   318.001   450.402    631.5312    573.352    782.2   1070.401    10
#>    shuffler_am(x, 3) 19003.501 19622.300  23314.4808  20784.551  28281.5  32885.101    10
#>   shuffler_csf(x, 3) 87692.701 96152.202 101233.5411 100925.201 108034.7 113814.901    10

我们可以注意到两点:a) 在所有逻辑中,prob_shuffler 的速度取决于 min.dist 而其他方法则没有那么多,b) prob_shuffler 大约是 100 - 仅 520 次观察(并且它可以缩放)更快。

当然,如果 min.dist 处有两个相同字符的概率非常高,那么递归方法应该更快。但在大多数实际情况下,概率法更快。