用于仅在括号内替换逗号的正则表达式

Regex for replacing commas only within brackets

我有一个成分数据集,每一行都是用逗号分隔的成分列表,例如:

燕麦 (24%)(轧制、麸皮)、椰子 (13%)(椰子、防腐剂 (220、223))、红糖、乳固体、金糖浆 (10%)、种子 (9%) (芝麻,向日葵),人造黄油(植物油,水,盐,乳化剂(471,大豆卵磷脂),抗氧化剂(307)),葡萄糖,牛奶巧克力化合物(5%)(糖,植物油,牛奶固体,Cocoa 粉末、乳化剂(大豆卵磷脂,492)、天然香料)、天然香料

我想解析文件以仅用分号替换括号内的逗号。括号内可以有任意数量的括号和任意数量的逗号。结果应如下所示:

燕麦 (24%)(轧制;麸皮)、椰子 (13%)(椰子;防腐剂 (220;223))、红糖、乳固体、金糖浆 (10%)、种子 (9%) (芝麻;向日葵),人造黄油(植物油;水;盐;乳化剂(471;大豆卵磷脂);抗氧化剂 (307)),葡萄糖,牛奶巧克力化合物 (5%)(糖;植物油;乳固体;Cocoa 粉末;乳化剂(大豆卵磷脂;492);天然香料),天然香料

我可以得到一些关于正则表达式的帮助来解决问题吗?提前谢谢你。

1) gsubfn 使用 gsubfn 无需复杂的正则表达式即可完成此操作。由点组成的正则表达式匹配单个字符。然后对于输入字符向量中的每个字符串,pre 函数将计数器 k 初始化为 0,然后对于每个匹配项 fun 是 运行,该字符通过x 参数。在 fun 中,计数器 k 每次遇到 ( 时加 1,每次遇到 ) 时减 1。如果计数器不为零并且遇到逗号,则返回分号来替换逗号;否则,返回输入的字符。这是矢量化的,也就是说,如果输入 s 是一个字符向量,其中每个分量都应该单独处理。

library(gsubfn)

p <- proto(k = 0, 
  pre = function(this) this$k <- 0,
  fun = function(this, x) {
    if (x == "(") this$k <- k + 1
    if (x == ")") this$k <- k - 1
    if (k && x == ",") ";" else x
  })
gsubfn(".", p, s)

给予:

[1] "Oats (24%) (Rolled; Bran), Coconut (13%) (Coconut ; Preservative (220; 223)), Brown Sugar, Milk Solids, Golden Syrup (10%), Seeds (9%) (Sesame ; Sunflower), Margarine (Vegetable Oil; Water; Salt; Emulsifiers (471; Soy Lecithin); Antioxidant (307)), Glucose, Milk Choc Compound (5%) (Sugar; Vegetable Oil; Milk Solids; Cocoa Powder; Emulsifiers (Soy Lecithin; 492); Natural Flavour), Natural Flavour"

2) Base R Base R 解决方案是将输入拆分为单个字符,给出字符向量列表 L。然后对于每个组件,chars , of L 创建一个计数器向量,k,与 chars 的长度相同,表示到该点的 ( 的数量减去 ) 的数量那一点。然后用分号替换那些对应于非零 k 的逗号,并将 chars 转换回单个字符串。像 (1) 这适用于字符向量。

L <- strsplit(s, "")
sapply(L, function(chars) {
  k <- cumsum((chars == "(") - (chars == ")"))
  chars[k & chars == ","] <- ";"
  paste(chars, collapse = "")
})

备注

输入字符串s如下

s <- "Oats (24%) (Rolled, Bran), Coconut (13%) (Coconut , Preservative (220, 223)), Brown Sugar, Milk Solids, Golden Syrup (10%), Seeds (9%) (Sesame , Sunflower), Margarine (Vegetable Oil, Water, Salt, Emulsifiers (471, Soy Lecithin), Antioxidant (307)), Glucose, Milk Choc Compound (5%) (Sugar, Vegetable Oil, Milk Solids, Cocoa Powder, Emulsifiers (Soy Lecithin, 492), Natural Flavour), Natural Flavour"

你可以用?R赞。

i <- gregexpr("\(([^()]|(?R))*\)", s, perl=TRUE)
regmatches(s, i)[[1]] <- gsub(",", ";", regmatches(s, i)[[1]])

s
#[1] "Oats (24%) (Rolled; Bran), Coconut (13%) (Coconut ; Preservative (220; 223)), Brown Sugar, Milk Solids, Golden Syrup (10%), Seeds (9%) (Sesame ; Sunflower), Margarine (Vegetable Oil; Water; Salt; Emulsifiers (471; Soy Lecithin); Antioxidant (307)), Glucose, Milk Choc Compound (5%) (Sugar; Vegetable Oil; Milk Solids; Cocoa Powder; Emulsifiers (Soy Lecithin; 492); Natural Flavour), Natural Flavour"

其中 a(?R)z 是匹配一个或多个字母 a 后跟完全相同数量的字母 z.

的递归

数据

s <- "Oats (24%) (Rolled, Bran), Coconut (13%) (Coconut , Preservative (220, 223)), Brown Sugar, Milk Solids, Golden Syrup (10%), Seeds (9%) (Sesame , Sunflower), Margarine (Vegetable Oil, Water, Salt, Emulsifiers (471, Soy Lecithin), Antioxidant (307)), Glucose, Milk Choc Compound (5%) (Sugar, Vegetable Oil, Milk Solids, Cocoa Powder, Emulsifiers (Soy Lecithin, 492), Natural Flavour), Natural Flavour"