Gsub 命令用逗号和 space, (", ") 替换所有 spaces,除了在某些单词之后用 R

Gsub command to replace all spaces with a comma and space, (", "), except after certain words with R

我有一个 data.frame,每个单元格中有一列包含由 space 分隔的加利福尼亚县。我想在每个之后添加一个逗号和 space,但是我不能将每个 space 都 gsub 成一个逗号和 space,(即 gsub("\s"," ,\s",text)), 因为加利福尼亚州的一些县有两个名称, (例如洛杉矶、旧金山等)

幸运的是,两个单词的县都有共同的第一个词,所以我想写一个 gsub 来保留这些县中的 space 而无需添加逗号。我附上了示例数据以及我希望最终表格看起来像什么。例如,对于这些数据,我想在“El”、“San”和“Del”之后添加一个逗号和 space。

示例数据:

c("Lassen Modoc Nevada Plumas Shasta Sierra Siskiyou Butte Placer", 
"Del Norte Humboldt Trinity Mendocino Sonoma", "Glenn Sutter Tehama Yuba Butte Colusa", 
"Lake Napa Yolo Colusa Sonoma Solano", "Madera Amador Tuolumne Calaveras Mariposa Mono Alpine El Dorado Placer", 
"El Dorado Placer Sacramento", "Sacramento Yolo", "Sacramento", 
"Sacramento San Joaquin", "Marin Sonoma")

期望的输出:

c("Lassen, Modoc, Nevada, Plumas, Shasta, Sierra, Siskiyou, Butte, Placer", 
  "Del Norte, Humboldt, Trinity, Mendocino, Sonoma", "Glenn, Sutter, Tehama, Yuba, Butte, Colusa", 
  "Lake, Napa, Yolo, Colusa, Sonoma, Solano", "Madera, Amador, Tuolumne, Calaveras, Mariposa, Mono, Alpine, El Dorado, Placer", 
  "El Dorado, Placer, Sacramento", "Sacramento, Yolo", "Sacramento", 
  "Sacramento, San Joaquin", "Marin, Sonoma")

鉴于您知道您只是在寻找加利福尼亚州的县,一种“简单”的方法是仅替换加利福尼亚州县之后出现的 spaces。为了获得该正则表达式,我只是将 CA 县名与 | 连接在一起并添加了一个 space。 gsub 将替换任何县名,后跟具有相同县名 (\1) 的 space、逗号和 space.

ca_regex <- "(Los Angeles|San Diego|Orange|Riverside|San Bernardino|Santa Clara|Alameda|Sacramento|Contra Costa|Fresno|Kern|San Francisco|Ventura|San Mateo|San Joaquin|Stanislaus|Sonoma|Tulare|Santa Barbara|Solano|Monterey|Placer|San Luis Obispo|Santa Cruz|Merced|Marin|Butte|Yolo|El Dorado|Imperial|Shasta|Madera|Kings|Napa|Humboldt|Nevada|Sutter|Mendocino|Yuba|Lake|Tehama|San Benito|Tuolumne|Calaveras|Siskiyou|Amador|Lassen|Glenn|Del Norte|Colusa|Plumas|Inyo|Mariposa|Mono|Trinity|Modoc|Sierra|Alpine) "

input <- c(
  "Lassen Modoc Nevada Plumas Shasta Sierra Siskiyou Butte Placer",
  "Del Norte Humboldt Trinity Mendocino Sonoma", "Glenn Sutter Tehama Yuba Butte Colusa",
  "Lake Napa Yolo Colusa Sonoma Solano", "Madera Amador Tuolumne Calaveras Mariposa Mono Alpine El Dorado Placer",
  "El Dorado Placer Sacramento", "Sacramento Yolo", "Sacramento",
  "Sacramento San Joaquin", "Marin Sonoma"
)

gsub(ca_regex, "\1, ", input)
#>  [1] "Lassen, Modoc, Nevada, Plumas, Shasta, Sierra, Siskiyou, Butte, Placer"        
#>  [2] "Del Norte, Humboldt, Trinity, Mendocino, Sonoma"                               
#>  [3] "Glenn, Sutter, Tehama, Yuba, Butte, Colusa"                                    
#>  [4] "Lake, Napa, Yolo, Colusa, Sonoma, Solano"                                      
#>  [5] "Madera, Amador, Tuolumne, Calaveras, Mariposa, Mono, Alpine, El Dorado, Placer"
#>  [6] "El Dorado, Placer, Sacramento"                                                 
#>  [7] "Sacramento, Yolo"                                                              
#>  [8] "Sacramento"                                                                    
#>  [9] "Sacramento, San Joaquin"                                                       
#> [10] "Marin, Sonoma"

如果需要,您可以使用负后视 ((?<!)) 来使用更短的模式。这个只接受 CA 县列表中的前缀词,并且只会替换不在其中一个之后的 spaces。不过,我认为这有点难以推理,也更难想出(例如,在第一次发帖时我忘记了两个-spaced San Luis Obispo County)

gsub("(?<!Los|San|Santa|Contra|El|Del|Luis) ", ", ", input, perl = TRUE)
#>  [1] "Lassen, Modoc, Nevada, Plumas, Shasta, Sierra, Siskiyou, Butte, Placer"        
#>  [2] "Del Norte, Humboldt, Trinity, Mendocino, Sonoma"                               
#>  [3] "Glenn, Sutter, Tehama, Yuba, Butte, Colusa"                                    
#>  [4] "Lake, Napa, Yolo, Colusa, Sonoma, Solano"                                      
#>  [5] "Madera, Amador, Tuolumne, Calaveras, Mariposa, Mono, Alpine, El Dorado, Placer"
#>  [6] "El Dorado, Placer, Sacramento"                                                 
#>  [7] "Sacramento, Yolo"                                                              
#>  [8] "Sacramento"                                                                    
#>  [9] "Sacramento, San Joaquin"                                                       
#> [10] "Marin, Sonoma"
gsub('(\w{4,}) ', '\1, ', vec)

 [1] "Lassen, Modoc, Nevada, Plumas, Shasta, Sierra, Siskiyou, Butte, Placer"        
 [2] "Del Norte, Humboldt, Trinity, Mendocino, Sonoma"                               
 [3] "Glenn, Sutter, Tehama, Yuba, Butte, Colusa"                                    
 [4] "Lake, Napa, Yolo, Colusa, Sonoma, Solano"                                      
 [5] "Madera, Amador, Tuolumne, Calaveras, Mariposa, Mono, Alpine, El Dorado, Placer"
 [6] "El Dorado, Placer, Sacramento"                                                 
 [7] "Sacramento, Yolo"                                                              
 [8] "Sacramento"                                                                    
 [9] "Sacramento, San Joaquin"                                                       
[10] "Marin, Sonoma"