R:计算字符串列表中的所有组合(特定顺序)

R: Count all combinations in a list of strings (Specific Order)

我正在尝试计算由“>”分隔的大量字符中的所有序列,但只计算彼此直接相邻的组合。

例如给定字符向量:

[1]Social>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>OrganicSearch>OrganicSearch>OrganicSearch
[2]Referral>Referral>Referral

我可以运行以下行来检索 2 个字符的所有组合:

split_fn <- sapply(p , strsplit , split = ">", perl=TRUE)

split_fn <- sapply(split_fn, function(x) paste(head(x,-1) , tail(x,-1) , sep = ">") )

Returns:

[[1]]

 [1] "Social>PaidSearch"           "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"      
 [6] "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"       "PaidSearch>PaidSearch"      
[11] "PaidSearch>OrganicSearch"    "OrganicSearch>OrganicSearch" "OrganicSearch>OrganicSearch"

[[2]]

[1] "Referral>Referral" "Referral>Referral"

这是我的数据中所有可能的 2 个字符序列(按顺序拆分)

我知道想要得到 3 个字符的所有可能结果。

例如

"Social>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch"..."Referral>Referral>Referral"

尝试使用

unlist(lapply(strsplit(p, split = ">"), function(i) combn(sort(i), 3, paste, collapse='>')))

但它 return 所有组合,包括不直接跟随的组合。

我也不希望它 return 第一行中的最后一个值与第二行中的第一个值等的组合。

使用 stringr 包(或一般的正则表达式)。

library(stringr)
str_extract_all(p, "(\w+)>(\w+)>(\w+)")

有重叠,但代码可以简化。

str_extract_all_overlap <- function (x) {
  extractions <- character()
  x_curr <- x
  extr <- str_extract(x_curr, "(\w+)>(\w+)>(\w+)")
  i = 1
  while (!is.na(extr)) {
    extractions[i] <- extr 
    x_curr <- str_replace(x_curr, "\w+", replacement = "")
    extr <- str_extract(x_curr, "(\w+)>(\w+)>(\w+)")
    i = i + 1
  }
  return(extractions)
}

lapply(p, str_extract_all_overlap)

让我们从创建一些数据开始:

set.seed(1)

data <- lapply(1:3, function(i) sample(LETTERS[1:3], rpois(1, 6), re = T))
data <- sapply(data, paste, collapse = ">")

data
#> [1] "B>B>C>A"           "C>B>B>A>A>A>C>B>C" "C>C>B>C>C>A"

考虑到这个问题,将这些数据视为列表是有意义的 用分隔符 >:

分割元素后得到的向量
strsplit(data, ">")
#> [[1]]
#> [1] "B" "B" "C" "A"
#> 
#> [[2]]
#> [1] "C" "B" "B" "A" "A" "A" "C" "B" "C"
#> 
#> [[3]]
#> [1] "C" "C" "B" "C" "C" "A"

现在,问题的核心是找到给定的所有连续序列 来自单个向量的长度。一旦我们可以做到这一点,申请就很简单了 我们拥有的数据列表;转换回分隔格式将 也要简单

有了这个目标,我们就可以创建一个函数来提取 序列;在这里我们只是遍历每个元素并提取 给定长度的所有序列到列表:

seqs <- function(x, length = 2) {
  if (length(x) < length)
    return(NULL)
  k <- length - 1
  lapply(seq_len(length(x) - k), function(i) x[i:(i + k)])
}

我们现在可以在之后对数据应用函数 将分隔的字符拆分为向量以获得结果。我们还需要一个额外的 sapplypaste 来将数据转换回我们开始的分隔格式:

lapply(strsplit(data, ">"), function(x) {
  sapply(seqs(x, 3), paste, collapse = ">")
})
#> [[1]]
#> [1] "B>B>C" "B>C>A"
#> 
#> [[2]]
#> [1] "C>B>B" "B>B>A" "B>A>A" "A>A>A" "A>A>C" "A>C>B" "C>B>C"
#> 
#> [[3]]
#> [1] "C>C>B" "C>B>C" "B>C>C" "C>C>A"

进一步,同时得到多个长度的序列,我们可以再增加一层迭代:

lapply(strsplit(data, ">"), function(x) {
  unlist(sapply(c(2, 3), function(n) {
    sapply(seqs(x, n), paste, collapse = ">")
  }))
})
#> [[1]]
#> [1] "B>B"   "B>C"   "C>A"   "B>B>C" "B>C>A"
#> 
#> [[2]]
#>  [1] "C>B"   "B>B"   "B>A"   "A>A"   "A>A"   "A>C"   "C>B"   "B>C"  
#>  [9] "C>B>B" "B>B>A" "B>A>A" "A>A>A" "A>A>C" "A>C>B" "C>B>C"
#> 
#> [[3]]
#> [1] "C>C"   "C>B"   "B>C"   "C>C"   "C>A"   "C>C>B" "C>B>C" "B>C>C" "C>C>A"

reprex package (v0.2.0) 创建于 2018-05-21。

您还可以将第二个 sapply 中的 paste-命令调整为:

paste(head(x,-2), head(tail(x,-1),-1), tail(x,-2) , sep = ">")

您的完整代码现在应该如下所示:

split_fn <- sapply(p , strsplit , split = ">", USE.NAMES = FALSE)

split_fn <- sapply(split_fn, function(x) paste(head(x,-2), head(tail(x,-1),-1), tail(x,-2), sep = ">") )

结果:

> split_fn
[[1]]
 [1] "Social>PaidSearch>PaidSearch"              "PaidSearch>PaidSearch>PaidSearch"          "PaidSearch>PaidSearch>PaidSearch"         
 [4] "PaidSearch>PaidSearch>PaidSearch"          "PaidSearch>PaidSearch>PaidSearch"          "PaidSearch>PaidSearch>PaidSearch"         
 [7] "PaidSearch>PaidSearch>PaidSearch"          "PaidSearch>PaidSearch>PaidSearch"          "PaidSearch>PaidSearch>PaidSearch"         
[10] "PaidSearch>PaidSearch>OrganicSearch"       "PaidSearch>OrganicSearch>OrganicSearch"    "OrganicSearch>OrganicSearch>OrganicSearch"

[[2]]
[1] "Referral>Referral>Referral"