正则表达式 - 在逗号上拆分字符串,跳过平衡括号之间的任何内容

Regex - Split String on Comma, Skip Anything Between Balanced Parentheses

需要在 R - Perl 中编写一个正则表达式,它将在逗号 ',' 上拆分字符串,但跳过圆括号之间的所有逗号实例。挑战在于确保括号平衡,即右括号映射回其左括号。

在下面的正则表达式代码中,一切正常,除非您注意到 - 括号不平衡,正在考虑将内部结束括号用于外部起始括号

text <- "PEANUTS (PEANUTS, PEANUT OIL AND/OR COTTONSEED OIL AND/OR CANOLA OIL AND/OR SOYBEAN OIL, SALT), GOLDEN RAISINS (RAISINS, SULFUR DIOXIDE), DRIED CRANBERRIES (CRANBERRIES, SUGAR, CITRIC ACID, SUNFLOWER OIL (PROCESSING AID), ELDERBERRY JUICE CONCENTRATE (COLOR)), ALMONDS (ALMONDS, PEANUT OIL AND/OR COTTONSEED OIL AND/OR CANOLA OIL AND/OR SOYBEAN OIL, SALT), MACADAMIAS (MACADAMIAS, MALTODEXTRIN, SALT)"

strsplit(text, '\([^*)^)]*\)(*SKIP)(*F)|\,', perl=T)

使用上面的正则表达式代码,蔓越莓干没有被正确分割。请参考这里的输出截图:Regex Code Output

如有任何帮助,我们将不胜感激。谢谢!

this question 的已接受答案的编辑似乎可以完成这项工作。我只是在开头添加了[[:alpha:][:space:]]*

pat <- '[[:alpha:][:space:]]*\(((?>[^()]+)|(?R))*\)'
regmatches(text, gregexpr(pat, text, perl = TRUE))
#[[1]]
#[1] "PEANUTS (PEANUTS, PEANUT OIL AND/OR COTTONSEED OIL AND/OR #CANOLA OIL AND/OR SOYBEAN OIL, SALT)"                            
#[2] " GOLDEN RAISINS (RAISINS, SULFUR DIOXIDE)"                                                                                 
#[3] " DRIED CRANBERRIES (CRANBERRIES, SUGAR, CITRIC ACID, SUNFLOWER #OIL (PROCESSING AID), ELDERBERRY JUICE CONCENTRATE (COLOR))"
#[4] " ALMONDS (ALMONDS, PEANUT OIL AND/OR COTTONSEED OIL AND/OR #CANOLA OIL AND/OR SOYBEAN OIL, SALT)"                           
#[5] " MACADAMIAS (MACADAMIAS, MALTODEXTRIN, SALT)" 

您可以使用

strsplit(text, "(\((?:[^()]++|(?1))*\))(*SKIP)(*F)|,", perl=TRUE)
# => [[1]]
[1] "PEANUTS (PEANUTS, PEANUT OIL AND/OR COTTONSEED OIL AND/OR CANOLA OIL AND/OR SOYBEAN OIL, SALT)"                            
[2] " GOLDEN RAISINS (RAISINS, SULFUR DIOXIDE)"                                                                                 
[3] " DRIED CRANBERRIES (CRANBERRIES, SUGAR, CITRIC ACID, SUNFLOWER OIL (PROCESSING AID), ELDERBERRY JUICE CONCENTRATE (COLOR))"
[4] " ALMONDS (ALMONDS, PEANUT OIL AND/OR COTTONSEED OIL AND/OR CANOLA OIL AND/OR SOYBEAN OIL, SALT)"                           
[5] " MACADAMIAS (MACADAMIAS, MALTODEXTRIN, SALT)" 

参见regex demo and an online R demo

详情

  • (\((?:[^()]++|(?1))*\)) - 捕获组#1
    • \( - 一个 ( 字符
    • (?:[^()]++|(?1))* - 除了 ()(带 [^()]++)或(|)之外的 1+ 个字符出现 0 次或多次第 1 组模式(递归以匹配所有嵌套级别)
    • \) - 一个 ) 字符
  • (*SKIP)(*F) - 这两个动词使引擎跳过当前匹配的字符串并继续查找紧接此文本之后的下一个匹配项。
  • | - 或
  • , - 一个逗号。