R - 检查并计算重复的子字符串并生成简化的字符串 [Value (repetitions)]
R - Check and count repetitive substrings and generate a simplified string [Value (repetitions)]
我正在使用 RGA 库获取一些 Google Analytics 的多渠道路径报告。其中一个维度代表用户的渠道路径,考虑到流量来源。如果用户 A f.i. 上个月访问了我们的网站 7 次,channel.path 会显示类似这样的内容:
Organic Search > Direct > Direct > Referral > Direct > Direct > Direct
我正在尝试简化此输出以显示类似的内容,
Organic Search > Direct (x2) > Referral > Direct (x3)
更易于阅读并模拟 Google 在其前端显示 channel.path 的方式。随着用户会话的增加,这个输出就更加必要了,因为有 channel.paths 有超过 30 个直接连续会话(例如,每天访问我们网站阅读新闻的人)可以通过一个独特的 Direct 来简化(x30).
我假设,第一步是从每个 channel.path:
创建一个子字符串列表
# Create 3 dummy strings that emulate possible channel.path
arr <- c("Organic Search > Direct > Direct", "Direct > Direct > Direct", "Referral")
# Split the dummy strings into substrings
arrSubStrings <- strsplit(arr, " > ")
生成以下列表:
> arrSubStrings
[[1]]
[1] "Organic Search" "Direct" "Direct"
[[2]]
[1] "Direct" "Direct" "Direct"
[[3]]
[1] "Referral"
然后从这里比较每个子串与前面的子串以检查重复,将 "counter" 存储在子串之间并使用粘贴再次将每个子串连接到一个字符串中。你知道我应该使用什么包或函数来实现这样的目标吗?
这看起来有点复杂,但逻辑非常简单。在使用我的 "splitstackshape" 包中的 cSplit
之后,它在 "data.table" 中使用 rle
。我还加载了 "dplyr" 以使链接步骤看起来更容易一些:
library(splitstackshape)
library(dplyr)
data.table(ID = 1:length(arr), arr = arr) %>% ## create a data.table of arr
cSplit("arr", ">", "long") %>% ## Split into a long form
.[, rle(as.character(arr)), by = .(ID)] %>% ## Calculate the run lengths
.[, paste(values, ## Paste values and lengths
sprintf(" (x%s)", lengths), ## ... after formatting lengths
collapse = " > ", sep = ""), ## ... collapsed by >
by = .(ID)] %>% ## ... and grouped by ID
.[, gsub(" (x1)", "", V1, fixed = TRUE)] ## Remove the (x1) values
# [1] "Organic Search > Direct (x2)"
# [2] "Direct (x3)"
# [3] "Referral"
# [4] "Organic Search > Direct (x2) > Referral > Direct (x3)"
# [5] "Organic Search (x2) > Direct > Organic Search (x2)"
这是相同的概念,但使用基础 R 完成:
arrSplit <- strsplit(arr, " > ", TRUE)
sapply(arrSplit, function(x) {
A <- rle(x)
A$lengths <- sprintf("(x%s)", A$lengths)
temp <- paste(A$values, A$lengths, collapse = " > ", sep = " ")
gsub(" (x1)", "", temp, fixed = TRUE)
})
# [1] "Organic Search > Direct (x2)"
# [2] "Direct (x3)"
# [3] "Referral"
# [4] "Organic Search > Direct (x2) > Referral > Direct (x3)"
# [5] "Organic Search (x2) > Direct > Organic Search (x2)"
示例数据:
arr <- c("Organic Search > Direct > Direct",
"Direct > Direct > Direct",
"Referral",
"Organic Search > Direct > Direct > Referral > Direct > Direct > Direct",
"Organic Search > Organic Search > Direct > Organic Search > Organic Search")
arr
# [1] "Organic Search > Direct > Direct"
# [2] "Direct > Direct > Direct"
# [3] "Referral"
# [4] "Organic Search > Direct > Direct > Referral > Direct > Direct > Direct"
# [5] "Organic Search > Organic Search > Direct > Organic Search > Organic Search"
我正在使用 RGA 库获取一些 Google Analytics 的多渠道路径报告。其中一个维度代表用户的渠道路径,考虑到流量来源。如果用户 A f.i. 上个月访问了我们的网站 7 次,channel.path 会显示类似这样的内容:
Organic Search > Direct > Direct > Referral > Direct > Direct > Direct
我正在尝试简化此输出以显示类似的内容,
Organic Search > Direct (x2) > Referral > Direct (x3)
更易于阅读并模拟 Google 在其前端显示 channel.path 的方式。随着用户会话的增加,这个输出就更加必要了,因为有 channel.paths 有超过 30 个直接连续会话(例如,每天访问我们网站阅读新闻的人)可以通过一个独特的 Direct 来简化(x30).
我假设,第一步是从每个 channel.path:
创建一个子字符串列表# Create 3 dummy strings that emulate possible channel.path
arr <- c("Organic Search > Direct > Direct", "Direct > Direct > Direct", "Referral")
# Split the dummy strings into substrings
arrSubStrings <- strsplit(arr, " > ")
生成以下列表:
> arrSubStrings
[[1]]
[1] "Organic Search" "Direct" "Direct"
[[2]]
[1] "Direct" "Direct" "Direct"
[[3]]
[1] "Referral"
然后从这里比较每个子串与前面的子串以检查重复,将 "counter" 存储在子串之间并使用粘贴再次将每个子串连接到一个字符串中。你知道我应该使用什么包或函数来实现这样的目标吗?
这看起来有点复杂,但逻辑非常简单。在使用我的 "splitstackshape" 包中的 cSplit
之后,它在 "data.table" 中使用 rle
。我还加载了 "dplyr" 以使链接步骤看起来更容易一些:
library(splitstackshape)
library(dplyr)
data.table(ID = 1:length(arr), arr = arr) %>% ## create a data.table of arr
cSplit("arr", ">", "long") %>% ## Split into a long form
.[, rle(as.character(arr)), by = .(ID)] %>% ## Calculate the run lengths
.[, paste(values, ## Paste values and lengths
sprintf(" (x%s)", lengths), ## ... after formatting lengths
collapse = " > ", sep = ""), ## ... collapsed by >
by = .(ID)] %>% ## ... and grouped by ID
.[, gsub(" (x1)", "", V1, fixed = TRUE)] ## Remove the (x1) values
# [1] "Organic Search > Direct (x2)"
# [2] "Direct (x3)"
# [3] "Referral"
# [4] "Organic Search > Direct (x2) > Referral > Direct (x3)"
# [5] "Organic Search (x2) > Direct > Organic Search (x2)"
这是相同的概念,但使用基础 R 完成:
arrSplit <- strsplit(arr, " > ", TRUE)
sapply(arrSplit, function(x) {
A <- rle(x)
A$lengths <- sprintf("(x%s)", A$lengths)
temp <- paste(A$values, A$lengths, collapse = " > ", sep = " ")
gsub(" (x1)", "", temp, fixed = TRUE)
})
# [1] "Organic Search > Direct (x2)"
# [2] "Direct (x3)"
# [3] "Referral"
# [4] "Organic Search > Direct (x2) > Referral > Direct (x3)"
# [5] "Organic Search (x2) > Direct > Organic Search (x2)"
示例数据:
arr <- c("Organic Search > Direct > Direct",
"Direct > Direct > Direct",
"Referral",
"Organic Search > Direct > Direct > Referral > Direct > Direct > Direct",
"Organic Search > Organic Search > Direct > Organic Search > Organic Search")
arr
# [1] "Organic Search > Direct > Direct"
# [2] "Direct > Direct > Direct"
# [3] "Referral"
# [4] "Organic Search > Direct > Direct > Referral > Direct > Direct > Direct"
# [5] "Organic Search > Organic Search > Direct > Organic Search > Organic Search"