使用正则表达式删除 R 中逗号和破折号之间的文本

Question

我想删除以逗号分隔保存的一长串变量标签中逗号和破折号之间的文本。这是我的字符串的一个最小示例：

myvarlabels <- ("participant number, How much do you like the following products-green tea, How much do you like the following products-beer,\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian green tea\",\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian beer\"")

重要的是，变量标签以两种不同的形式出现，应按以下方式缩短：

您喜欢以下产品-绿茶
应该减少为：绿茶
\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian green tea\"
应该减少为：\"Japanese, Chinese, and Indian green tea\"

我尝试使用 gsub 和 正则表达式 来识别然后删除逗号和破折号之间的文本（即替换带有“”）的文本。

有没有人建议我如何使用 gsub 删除逗号 之间的文本，这些文本表示新列的开始 和破折号 后面是我想在保留双引号的同时保留的文本？

编辑 1

更准确地说，数据包括三种类型的以逗号分隔的文本块。它们都指定了对应的变量包含什么信息：

包含一个或多个单词的简短描述（例如参与者编号）
更长的描述，相关信息仅出现在破折号之后（例如，您对以下产品的喜爱程度-绿茶）
与上面相同，但在破折号之前的某处有逗号（例如，多少，如果有的话，你会...）；这就是为什么这种类型的文本块前后都带有 \"（否则无法正确读取）
同上，但破折号前没有逗号（例如，您对以下产品有多少经验）

这四种文本序列的前后都是逗号，可以任意顺序出现。

这是一个新的最小示例，它比我的第一个示例更准确地反映了真实数据：

(myvarlabels3 <- ("participant number,age,gender,body mass index,How much do you like the following products-green tea,How much do you like the following products-beer,outdoor temperature,season,\"How much experience do you have with the following products-Indian spices\",\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian beer\",email,telephone number"))

Cath 的代码（编辑 2）在一定程度上有效。当我在字符串的开头添加更多 "simple" 类型 1 文本序列时，或者当我在上面的列表中添加 4. 下指定的文本序列时，代码不再正常工作。

但是，当编辑 2 中 Cath 的代码分两步运行时，它可以完美运行：

myvarlabels3 <- gsub("((?<=,\")[^-]*[^-]+-)|((?<=,\")[^-],*[^-]+-)", "", myvarlabels3, perl=TRUE) # step 1: shorten the text sequences specified under 3. and 4. in the list above

[1] "participant number,age,gender,body mass index,How much do you like the following products-green tea,How much do you like the following products-beer,outdoor temperature,season,\"Indian spices\",\"Japanese, Chinese, and Indian beer\",email,telephone number"

gsub("((?<=,)[^-\",]+-)", "", myvarlabels3, perl=TRUE) # step 2: shorten the text sequences specified as 2. in the above list

[1] "participant number,age,gender,body mass index,green tea,beer,outdoor temperature,season,\"Indian spices\",\"Japanese, Chinese, and Indian beer\",email,telephone number"

我认为只使用一行代码可能是可能的，但我不知道如何做。无论如何，当我从 Qualtrics 导入杂乱的 csv 文件时，这将极大地促进我的工作流程。

Answer 1

我不确定我是否理解您想要的输出是什么，但您可以尝试根据 "How much" 找出 "start of a new column"，然后继续 "meet" 破折号：

gsub("(^[^,]+, )|(How much[^-]+-)", "", myvarlabels, perl=TRUE)
[1] "green tea, beer,\"Japanese, Chinese, and Indian green tea\",\"Japanese, Chinese, and Indian beer\""

编辑

考虑到您的模式，您可以尝试以下方法：

gsub("((?<=, )[^-\"]+-)|((?<=,\")[^-]*,[^-]+-)", "", myvarlabels, perl=TRUE)
[1] "participant number, green tea, beer,\"Japanese, Chinese, and Indian green tea\",\"Japanese, Chinese, and Indian beer\""

根据您描述的 2 种可能的模式，我使用了 2 种可能的模式，并向后看以指定应该存在但需要保留的内容

EDIT2

如果逗号和不以引号开头的问题之间没有 space，您可以这样做：

myvarlabels_2 <- ("participant number,How much do you like the following products-green tea, How much do you like the following products-beer,\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian green tea\",\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian beer\"")
gsub("((?<=,)[^-\",]+-)|((?<=,\")[^-]*,[^-]+-)", "", myvarlabels_2, perl=TRUE)
[1] "participant number,green tea,beer,\"Japanese, Chinese, and Indian green tea\",\"Japanese, Chinese, and Indian beer\""

使用正则表达式删除 R 中逗号和破折号之间的文本

Remove text between comma and dash in R with regular expressions

regex

import

r

gsub

qualtrics