strsplit 并不总是在 '?' 上拆分
strsplit doesn't always split on '?'
我想(对于 LSAfun::genericSummary)将一些字符串拆分为 c(".", "!", "?")
。我使用选项 fixed = TRUE
但它仍然是 return 的错误结果。
我想明白为什么它不起作用,因为我无法修改调用。
实际上,它不是直接调用的,而是通过LSAfun::genericSummary
调用的。结果不是预期的,因为 strsplit 意外的结果。
strsplit("Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?",
split = c(".", "!", "?"), fixed = TRUE)[[1]]
returns :
[1] "Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?"
预计:
[1] "Faut-il reconnaitre le vote blanc " " Faut-il rendre le vote obligatoire " ""
我迷路了...有人需要解释吗?
> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252 LC_NUMERIC=C LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.3.0 yaml_2.1.18
函数:
function (text, k, split = c(".", "!", "?"), min = 5, breakdown = FALSE,
...)
{
sentences <- unlist(strsplit(text, split = split, fixed = T))
if (breakdown == TRUE) {
sentences <- breakdown(sentences)
}
sentences <- sentences[nchar(sentences) > min]
td = tempfile()
dir.create(td)
for (i in 1:length(sentences)) {
docname <- paste("sentence", i, ".txt", sep = "")
write(sentences[i], file = paste(td, docname, sep = "/"))
}
A <- textmatrix(td, ...)
rownames <- rownames(A)
colnames <- colnames(A)
A <- matrix(A, nrow = nrow(A), ncol = ncol(A))
rownames(A) <- rownames
colnames(A) <- colnames
unlink(td, T, T)
Vt <- lsa(A, dims = length(sentences))$dk
snum <- vector(length = k)
for (i in 1:k) {
snum[i] <- names(Vt[, i][abs(Vt[, i]) == max(abs(Vt[,
i]))])
}
snum <- gsub(snum, pattern = "[[:alpha:]]", replacement = "")
snum <- gsub(snum, pattern = "[[:punct:]]", replacement = "")
snum <- as.integer(snum)
summary.sentences <- sentences[snum]
return(summary.sentences)
}
<environment: namespace:LSAfun>
对于多个 split
元素,将其放在 []
内并删除 fixed = TRUE
或 paste
带有 |
的模式以拆分其中之一
strsplit("Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?",
split = "[.!?]")[[1]]
根据?strsplit
split - If empty matches occur, in particular if split has length 0, x is split into single characters. If split has length greater than 1, it is re-cycled along x.
您也可以省略 fixed = TRUE
部分并转义字符,即
strsplit("Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?", c("\.|!|\?"))
当然它不会那么有效,因为我们正在通过正则表达式引擎。
我想(对于 LSAfun::genericSummary)将一些字符串拆分为 c(".", "!", "?")
。我使用选项 fixed = TRUE
但它仍然是 return 的错误结果。
我想明白为什么它不起作用,因为我无法修改调用。
实际上,它不是直接调用的,而是通过LSAfun::genericSummary
调用的。结果不是预期的,因为 strsplit 意外的结果。
strsplit("Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?",
split = c(".", "!", "?"), fixed = TRUE)[[1]]
returns :
[1] "Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?"
预计:
[1] "Faut-il reconnaitre le vote blanc " " Faut-il rendre le vote obligatoire " ""
我迷路了...有人需要解释吗?
> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252 LC_NUMERIC=C LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.3.0 yaml_2.1.18
函数:
function (text, k, split = c(".", "!", "?"), min = 5, breakdown = FALSE,
...)
{
sentences <- unlist(strsplit(text, split = split, fixed = T))
if (breakdown == TRUE) {
sentences <- breakdown(sentences)
}
sentences <- sentences[nchar(sentences) > min]
td = tempfile()
dir.create(td)
for (i in 1:length(sentences)) {
docname <- paste("sentence", i, ".txt", sep = "")
write(sentences[i], file = paste(td, docname, sep = "/"))
}
A <- textmatrix(td, ...)
rownames <- rownames(A)
colnames <- colnames(A)
A <- matrix(A, nrow = nrow(A), ncol = ncol(A))
rownames(A) <- rownames
colnames(A) <- colnames
unlink(td, T, T)
Vt <- lsa(A, dims = length(sentences))$dk
snum <- vector(length = k)
for (i in 1:k) {
snum[i] <- names(Vt[, i][abs(Vt[, i]) == max(abs(Vt[,
i]))])
}
snum <- gsub(snum, pattern = "[[:alpha:]]", replacement = "")
snum <- gsub(snum, pattern = "[[:punct:]]", replacement = "")
snum <- as.integer(snum)
summary.sentences <- sentences[snum]
return(summary.sentences)
}
<environment: namespace:LSAfun>
对于多个 split
元素,将其放在 []
内并删除 fixed = TRUE
或 paste
带有 |
的模式以拆分其中之一
strsplit("Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?",
split = "[.!?]")[[1]]
根据?strsplit
split - If empty matches occur, in particular if split has length 0, x is split into single characters. If split has length greater than 1, it is re-cycled along x.
您也可以省略 fixed = TRUE
部分并转义字符,即
strsplit("Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?", c("\.|!|\?"))
当然它不会那么有效,因为我们正在通过正则表达式引擎。