使用 cSplit 按大写字母将字符串拆分为多行

Question

我有调查数据。有些问题允许有多个答案。在我的数据中，不同的答案用逗号分隔。我想为每个选择在数据框中添加一个新行。所以我有这样的东西：

survey$q1 <- c("I like this", "I like that", "I like this, but not much",
 "I like that, but not much", "I like this,I like that", 
"I like this, but not much,I like that")

如果只用逗号分隔多个选项，我会使用：

survey <- cSplit(survey, "q1", ",", direction = "long")

并得到想要的结果。鉴于一些逗号是答案的一部分，我尝试使用逗号后跟大写字母作为分隔符：

survey <- cSplit(survey, "q1", ",(?=[A-Z])", direction = "long")

但由于某种原因它不起作用。它不会给出任何错误，但不会拆分字符串，还会从数据框中删除一些行。然后我尝试使用 strsplit:

strsplit(survey, ",(?=[A-Z])", perl=T)

它可以正确拆分它，但我无法实现它，以便每个句子都变成同一列的不同行，就像 cSplit 那样。要求的输出是：

survey$q1
[1] "I like this"
[2] "I like that"
[3] "I like this, but not much"
[4] "I like that, but not much"
[5] "I like this"
[6] "I like that"
[7] "I like this, but not much"
[8] "I like that"

有没有一种方法可以使用这两种方法中的一种来获取它？谢谢

Answer 1

选项 separate_rows

library(dplyr)
library(tidyr)
survey %>% 
   separate_rows(q1, sep=",(?=[A-Z])")
#                       q1
#1               I like this
#2               I like that
#3 I like this, but not much
#4 I like that, but not much
#5               I like this
#6               I like that
#7 I like this, but not much
#8               I like that

用cSplit，有个参数fixed，默认是TRUE，但是如果我们用fixed = FALSE，可能会失败。可能是因为它没有针对 PCRE 正则表达式进行优化

library(splitstackshape)
cSplit(survey, "q1", ",(?=[A-Z])", direction = "long", fixed = FALSE)

Error in strsplit(indt[[splitCols[x]]], split = sep[x], fixed = fixed) : invalid regular expression ',(?=[A-Z])', reason 'Invalid regexp'

绕过它的一个选择是使用函数 (sub/gsub) 修改列，该函数可以采用 PCRE 正则表达式来更改 sep，然后在 cSplit 上使用 sep

cSplit(transform(survey, q1 = sub(",(?=[A-Z])", ":", q1, perl = TRUE)), 
         "q1", sep=":", direction = "long")
#                        q1
#1:               I like this
#2:               I like that
#3: I like this, but not much
#4: I like that, but not much
#5:               I like this
#6:               I like that
#7: I like this, but not much
#8:               I like that

数据

survey <- structure(list(q1 = c("I like this", "I like that", "I like this, but not much", 
"I like that, but not much", "I like this,I like that", "I like this, but not much,I like that"
)), class = "data.frame", row.names = c(NA, -6L))

Answer 2

@ak运行的回答是正确的。我只是想补充一点，如果您需要将一些字符串分成两个以上的部分，他的代码的工作方式就是多次运行同一行。我不完全确定为什么会这样，但它有效

使用 cSplit 按大写字母将字符串拆分为多行

Split string into multiple rows by capital letters with cSplit

r

strsplit

csplit

数据