R:strsplit 中的正则表达式(查找“,”后跟大写字母)

R: Regex in strsplit (finding ", " followed by capital letter)

假设我有一个向量,其中包含一些我想根据正则表达式拆分的字符。

更准确地说,我想根据逗号拆分字符串,然后是 space,然后是大写字母(据我了解,regex 命令看起来像这个:/(, [A-Z])/g(当我尝试它时效果很好 here))。

当我尝试在 r 中实现这一点时,regex 似乎不起作用,例如:

x <- c("Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)",
  "Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)")

strsplit(x, "/(, [A-Z])/g")
[[1]]
[1] "Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)"

[[2]]
[1] "Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)"

没有发现分裂。我在这里做错了什么?

非常感谢任何帮助!

这是一个解决方案:

strsplit(x, ", (?=[A-Z])", perl=T)

IDEONE demo

输出:

[[1]]
[1] "Non MMF investment funds"                                       
[2] "Insurance corporations"                                         
[3] "Assets (Net Acquisition of)"                                    
[4] "Loans"                                                          
[5] "Long-term original maturity (over 1 year or no stated maturity)"

[[2]]
[1] "Non financial corporations"                                                                                
[2] "Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds"
[3] "Assets (Net Acquisition of)"                                                                               
[4] "Loans"                                                                                                     
[5] "Short-term original maturity (up to 1 year)"

正则表达式 - ", (?=[A-Z])" - 包含前瞻性 (?=[A-Z]) 检查但不使用大写字母。在 R 中,您需要将 perl=T 与包含环视的正则表达式一起使用。

如果space是可选的,或者逗号和大写字母之间可以有双space,使用

strsplit(x, ",\s*(?=[A-Z])", perl=T)

还有一种支持 Unicode 字母的变体(\p{Lu}):

strsplit(x, ", (?=\p{Lu})", perl=T)