R:计算字符串列表中的所有组合(特定顺序)
R: Count all combinations in a list of strings (Specific Order)
我正在尝试计算由“>”分隔的大量字符中的所有序列,但只计算彼此直接相邻的组合。
例如给定字符向量:
[1]Social>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>OrganicSearch>OrganicSearch>OrganicSearch
[2]Referral>Referral>Referral
我可以运行以下行来检索 2 个字符的所有组合:
split_fn <- sapply(p , strsplit , split = ">", perl=TRUE)
split_fn <- sapply(split_fn, function(x) paste(head(x,-1) , tail(x,-1) , sep = ">") )
Returns:
[[1]]
[1] "Social>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch"
[6] "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch"
[11] "PaidSearch>OrganicSearch" "OrganicSearch>OrganicSearch" "OrganicSearch>OrganicSearch"
[[2]]
[1] "Referral>Referral" "Referral>Referral"
这是我的数据中所有可能的 2 个字符序列(按顺序拆分)
我知道想要得到 3 个字符的所有可能结果。
例如
"Social>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch"..."Referral>Referral>Referral"
尝试使用
unlist(lapply(strsplit(p, split = ">"), function(i) combn(sort(i), 3, paste, collapse='>')))
但它 return 所有组合,包括不直接跟随的组合。
我也不希望它 return 第一行中的最后一个值与第二行中的第一个值等的组合。
使用 stringr
包(或一般的正则表达式)。
library(stringr)
str_extract_all(p, "(\w+)>(\w+)>(\w+)")
有重叠,但代码可以简化。
str_extract_all_overlap <- function (x) {
extractions <- character()
x_curr <- x
extr <- str_extract(x_curr, "(\w+)>(\w+)>(\w+)")
i = 1
while (!is.na(extr)) {
extractions[i] <- extr
x_curr <- str_replace(x_curr, "\w+", replacement = "")
extr <- str_extract(x_curr, "(\w+)>(\w+)>(\w+)")
i = i + 1
}
return(extractions)
}
lapply(p, str_extract_all_overlap)
让我们从创建一些数据开始:
set.seed(1)
data <- lapply(1:3, function(i) sample(LETTERS[1:3], rpois(1, 6), re = T))
data <- sapply(data, paste, collapse = ">")
data
#> [1] "B>B>C>A" "C>B>B>A>A>A>C>B>C" "C>C>B>C>C>A"
考虑到这个问题,将这些数据视为列表是有意义的
用分隔符 >
:
分割元素后得到的向量
strsplit(data, ">")
#> [[1]]
#> [1] "B" "B" "C" "A"
#>
#> [[2]]
#> [1] "C" "B" "B" "A" "A" "A" "C" "B" "C"
#>
#> [[3]]
#> [1] "C" "C" "B" "C" "C" "A"
现在,问题的核心是找到给定的所有连续序列
来自单个向量的长度。一旦我们可以做到这一点,申请就很简单了
我们拥有的数据列表;转换回分隔格式将
也要简单
有了这个目标,我们就可以创建一个函数来提取
序列;在这里我们只是遍历每个元素并提取
给定长度的所有序列到列表:
seqs <- function(x, length = 2) {
if (length(x) < length)
return(NULL)
k <- length - 1
lapply(seq_len(length(x) - k), function(i) x[i:(i + k)])
}
我们现在可以在之后对数据应用函数
将分隔的字符拆分为向量以获得结果。我们还需要一个额外的 sapply
和 paste
来将数据转换回我们开始的分隔格式:
lapply(strsplit(data, ">"), function(x) {
sapply(seqs(x, 3), paste, collapse = ">")
})
#> [[1]]
#> [1] "B>B>C" "B>C>A"
#>
#> [[2]]
#> [1] "C>B>B" "B>B>A" "B>A>A" "A>A>A" "A>A>C" "A>C>B" "C>B>C"
#>
#> [[3]]
#> [1] "C>C>B" "C>B>C" "B>C>C" "C>C>A"
进一步,同时得到多个长度的序列,我们可以再增加一层迭代:
lapply(strsplit(data, ">"), function(x) {
unlist(sapply(c(2, 3), function(n) {
sapply(seqs(x, n), paste, collapse = ">")
}))
})
#> [[1]]
#> [1] "B>B" "B>C" "C>A" "B>B>C" "B>C>A"
#>
#> [[2]]
#> [1] "C>B" "B>B" "B>A" "A>A" "A>A" "A>C" "C>B" "B>C"
#> [9] "C>B>B" "B>B>A" "B>A>A" "A>A>A" "A>A>C" "A>C>B" "C>B>C"
#>
#> [[3]]
#> [1] "C>C" "C>B" "B>C" "C>C" "C>A" "C>C>B" "C>B>C" "B>C>C" "C>C>A"
由 reprex package (v0.2.0) 创建于 2018-05-21。
您还可以将第二个 sapply
中的 paste
-命令调整为:
paste(head(x,-2), head(tail(x,-1),-1), tail(x,-2) , sep = ">")
您的完整代码现在应该如下所示:
split_fn <- sapply(p , strsplit , split = ">", USE.NAMES = FALSE)
split_fn <- sapply(split_fn, function(x) paste(head(x,-2), head(tail(x,-1),-1), tail(x,-2), sep = ">") )
结果:
> split_fn
[[1]]
[1] "Social>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch"
[4] "PaidSearch>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch"
[7] "PaidSearch>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch"
[10] "PaidSearch>PaidSearch>OrganicSearch" "PaidSearch>OrganicSearch>OrganicSearch" "OrganicSearch>OrganicSearch>OrganicSearch"
[[2]]
[1] "Referral>Referral>Referral"
我正在尝试计算由“>”分隔的大量字符中的所有序列,但只计算彼此直接相邻的组合。
例如给定字符向量:
[1]Social>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>PaidSearch>OrganicSearch>OrganicSearch>OrganicSearch
[2]Referral>Referral>Referral
我可以运行以下行来检索 2 个字符的所有组合:
split_fn <- sapply(p , strsplit , split = ">", perl=TRUE)
split_fn <- sapply(split_fn, function(x) paste(head(x,-1) , tail(x,-1) , sep = ">") )
Returns:
[[1]]
[1] "Social>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch"
[6] "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch" "PaidSearch>PaidSearch"
[11] "PaidSearch>OrganicSearch" "OrganicSearch>OrganicSearch" "OrganicSearch>OrganicSearch"
[[2]]
[1] "Referral>Referral" "Referral>Referral"
这是我的数据中所有可能的 2 个字符序列(按顺序拆分)
我知道想要得到 3 个字符的所有可能结果。
例如
"Social>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch"..."Referral>Referral>Referral"
尝试使用
unlist(lapply(strsplit(p, split = ">"), function(i) combn(sort(i), 3, paste, collapse='>')))
但它 return 所有组合,包括不直接跟随的组合。
我也不希望它 return 第一行中的最后一个值与第二行中的第一个值等的组合。
使用 stringr
包(或一般的正则表达式)。
library(stringr)
str_extract_all(p, "(\w+)>(\w+)>(\w+)")
有重叠,但代码可以简化。
str_extract_all_overlap <- function (x) {
extractions <- character()
x_curr <- x
extr <- str_extract(x_curr, "(\w+)>(\w+)>(\w+)")
i = 1
while (!is.na(extr)) {
extractions[i] <- extr
x_curr <- str_replace(x_curr, "\w+", replacement = "")
extr <- str_extract(x_curr, "(\w+)>(\w+)>(\w+)")
i = i + 1
}
return(extractions)
}
lapply(p, str_extract_all_overlap)
让我们从创建一些数据开始:
set.seed(1)
data <- lapply(1:3, function(i) sample(LETTERS[1:3], rpois(1, 6), re = T))
data <- sapply(data, paste, collapse = ">")
data
#> [1] "B>B>C>A" "C>B>B>A>A>A>C>B>C" "C>C>B>C>C>A"
考虑到这个问题,将这些数据视为列表是有意义的
用分隔符 >
:
strsplit(data, ">")
#> [[1]]
#> [1] "B" "B" "C" "A"
#>
#> [[2]]
#> [1] "C" "B" "B" "A" "A" "A" "C" "B" "C"
#>
#> [[3]]
#> [1] "C" "C" "B" "C" "C" "A"
现在,问题的核心是找到给定的所有连续序列 来自单个向量的长度。一旦我们可以做到这一点,申请就很简单了 我们拥有的数据列表;转换回分隔格式将 也要简单
有了这个目标,我们就可以创建一个函数来提取 序列;在这里我们只是遍历每个元素并提取 给定长度的所有序列到列表:
seqs <- function(x, length = 2) {
if (length(x) < length)
return(NULL)
k <- length - 1
lapply(seq_len(length(x) - k), function(i) x[i:(i + k)])
}
我们现在可以在之后对数据应用函数
将分隔的字符拆分为向量以获得结果。我们还需要一个额外的 sapply
和 paste
来将数据转换回我们开始的分隔格式:
lapply(strsplit(data, ">"), function(x) {
sapply(seqs(x, 3), paste, collapse = ">")
})
#> [[1]]
#> [1] "B>B>C" "B>C>A"
#>
#> [[2]]
#> [1] "C>B>B" "B>B>A" "B>A>A" "A>A>A" "A>A>C" "A>C>B" "C>B>C"
#>
#> [[3]]
#> [1] "C>C>B" "C>B>C" "B>C>C" "C>C>A"
进一步,同时得到多个长度的序列,我们可以再增加一层迭代:
lapply(strsplit(data, ">"), function(x) {
unlist(sapply(c(2, 3), function(n) {
sapply(seqs(x, n), paste, collapse = ">")
}))
})
#> [[1]]
#> [1] "B>B" "B>C" "C>A" "B>B>C" "B>C>A"
#>
#> [[2]]
#> [1] "C>B" "B>B" "B>A" "A>A" "A>A" "A>C" "C>B" "B>C"
#> [9] "C>B>B" "B>B>A" "B>A>A" "A>A>A" "A>A>C" "A>C>B" "C>B>C"
#>
#> [[3]]
#> [1] "C>C" "C>B" "B>C" "C>C" "C>A" "C>C>B" "C>B>C" "B>C>C" "C>C>A"
由 reprex package (v0.2.0) 创建于 2018-05-21。
您还可以将第二个 sapply
中的 paste
-命令调整为:
paste(head(x,-2), head(tail(x,-1),-1), tail(x,-2) , sep = ">")
您的完整代码现在应该如下所示:
split_fn <- sapply(p , strsplit , split = ">", USE.NAMES = FALSE)
split_fn <- sapply(split_fn, function(x) paste(head(x,-2), head(tail(x,-1),-1), tail(x,-2), sep = ">") )
结果:
> split_fn
[[1]]
[1] "Social>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch"
[4] "PaidSearch>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch"
[7] "PaidSearch>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch" "PaidSearch>PaidSearch>PaidSearch"
[10] "PaidSearch>PaidSearch>OrganicSearch" "PaidSearch>OrganicSearch>OrganicSearch" "OrganicSearch>OrganicSearch>OrganicSearch"
[[2]]
[1] "Referral>Referral>Referral"