根据特定元素的位置在字符串向量中插入连字符或破折号

Inserting hyphen or en dash in a string vector depending on location of specific elements

给出的是 vecA:

vecA <- c("Population 1222",
          "Population 90over",
          "population under78",
          "population 99101",
          "Population 1254", 
          "Population 78 92")

问题

我想得出 vecB 对应于:

vecB <- c("Population 12 - 22",
          "Population 90 over",
          "population under 78",
          "population 99 - 101",
          "Population 12 - 54", 
          "Population 78 - 92")

主要特征

vecB具有以下特点:


尝试

我正在考虑使用 中的组,行如下:

gsub("^([[:alpha:]]*[[:blank:]])(\d{2})(.*)$", "\2", vecA)

但这并不适用于所有情况:

> t(t(gsub("^([[:alpha:]]*[[:blank:]])(\d{2})(.*)$", "\2", vecA)))
     [,1]                
[1,] "12"                
[2,] "90"                
[3,] "population under78"
[4,] "99"                
[5,] "12"                
[6,] "78" 

t() 仅用于展示目的; regex101 link.

这是我的建议 - 分两步进行:1) 先在数字之间添加连字符,然后 2) 在单词 "over"/"under" 之间添加 space和号码:

vecA <- c("Population 1222",
           "Population 90over",
           "population under78",
           "population 99101",
           "Population 1254", 
           "Population 78 92")
v <- gsub("^([[:alpha:]]+[[:blank:]]+)([[:digit:]]{2})\s*([[:digit:]])", "\1\2 - \3", vecA)
gsub("^([[:alpha:]]+[[:blank:]]+)(?|(over|under)(\d+)|(\d+)(over|under))", "\1\2 \3", v, perl=T)

code demo 的输出:

[1] "Population 12 - 22"  "Population 90 over"  "population under 78"
[4] "population 99 - 101" "Population 12 - 54"  "Population 78 - 92"

第二个正则表达式包含分支重置模式 (?|...|...) 以在替代子模式中保持相同的组 ID,因此需要 perl=T