我正在尝试使用 stringr,特别是正则表达式,来切割 "MA: Bristol County (25005)"

I'm trying to use stringr, specifically regex, to cut up "MA: Bristol County (25005)"

我正在尝试获取可变列并将其分成几列。这些值遵循基本模式,县名具有多种长度和格式。

State-county :
[1] "MA: Bristol County (25005)"
[2] "LA: St. Tammany Parish (22103)"
[3] "CA: Ventura County (06111)"    
[4] "CA: San Mateo County (06081)" 

我需要可以添加回 data.frame 的州、县名称和县代码列。一直想弄清楚如何使用 str_extract 来完成任务。 理想情况下,这就是我最终的目标,但我会接受我能得到的任何帮助。

  state:    county:            county code: 
[1] "MA"   Bristol County       25005
[2] "LA"   St. Tammany Parish   22103
[3] "CA"   Ventura County       06111    
[4] "CA:   San Mateo County     06081

我能够使用我发现 str_extract_all( "(?<=\().+?(?=\))") 的代码作为县代码(感谢 Nettle),我能到达的最接近州的缩写是 'str_extract_all( h,"..:") 这很接近但包括“:” 也尝试过:str_extract_all( "(?<=\:")

抱歉,如果这不是最好的格式,我已经尽力以我见过的风格表达清楚了。

使用str_match_all:

str_match_all(df$State_county, "([A-Z]+): ([^()]+) \((\d+)\)")

as_tibble(df) %>%
 mutate(matches=str_match_all(State_county, "([A-Z]+): ([^()]+) \((\d+)\)")) %>%
  unnest_wider(matches) %>%
   select(-2) %>%
    set_names("State_county", "State", "County", "ZIP")
# A tibble: 4 x 4
  State_county                   State County             ZIP  
  <fct>                          <chr> <chr>              <chr>
1 MA: Bristol County (25005)     MA    Bristol County     25005
2 LA: St. Tammany Parish (22103) LA    St. Tammany Parish 22103
3 CA: Ventura County (06111)     CA    Ventura County     06111
4 CA: San Mateo County (06081)   CA    San Mateo County   06081

### OR with str_match as we're only using a single pattern
## this saves us from the warning caused by unnest_wider
as_tibble(df)  %>%
 mutate(matches=str_match(State_county, "([A-Z]+): ([^()]+) \((\d+)\)"), State=matches[,2], County=matches[,3], ZIP=matches[,4], matches=NULL)
# A tibble: 4 x 4
  State_county                   State County             ZIP  
  <fct>                          <chr> <chr>              <chr>
1 MA: Bristol County (25005)     MA    Bristol County     25005
2 LA: St. Tammany Parish (22103) LA    St. Tammany Parish 22103
3 CA: Ventura County (06111)     CA    Ventura County     06111
4 CA: San Mateo County (06081)   CA    San Mateo County   06081 
### Another way 
str_match(df$State_county, "([A-Z]+): ([^()]+) \((\d+)\)") %>%
 as.data.frame %>% set_names("State_county", "State", "County", "County_code")
                    State_county State             County County_code
1     MA: Bristol County (25005)    MA     Bristol County       25005
2 LA: St. Tammany Parish (22103)    LA St. Tammany Parish       22103
3     CA: Ventura County (06111)    CA     Ventura County       06111
4   CA: San Mateo County (06081)    CA   San Mateo County       06081

解释:

str_match 基本上 return 捕获的组(写在非转义括号 ([A-Z]+) 中的子模式)和匹配完整模式的完整字符串

  • [A-Z]+ : 匹配状态缩写。
  • [^()]+ :匹配任何不是左括号的内容。县.
  • \((\d+)\) :匹配左括号 \( 并在使用分组拉数字时关闭一个。县代码。
str_match(df$State_county, "([A-Z]+): ([^()]+) \((\d+)\)")
     [,1]                             [,2] [,3]                 [,4]   
[1,] "MA: Bristol County (25005)"     "MA" "Bristol County"     "25005"
[2,] "LA: St. Tammany Parish (22103)" "LA" "St. Tammany Parish" "22103"
[3,] "CA: Ventura County (06111)"     "CA" "Ventura County"     "06111"
[4,] "CA: San Mateo County (06081)"   "CA" "San Mateo County"   "06081"

您可以使用 tidyrextract 将数据放入不同的列中,指定要使用的正则表达式来划分数据。

tidyr::extract(df, col, 
               c('state', 'county', 'county_code'), 
               '(\w+):\s*(.*)\((\d+)\)')

#  state             county  county_code
#1    MA     Bristol County        25005
#2    LA St. Tammany Parish        22103
#3    CA     Ventura County        06111
#4    CA   San Mateo County        06081

我们使用 3 个捕获组从 col 列中提取数据。

数据

df <- structure(list(col = c("MA: Bristol County (25005)", 
                "LA: St. Tammany Parish (22103)", 
"CA: Ventura County (06111)", "CA: San Mateo County (06081)")), 
 class = "data.frame", row.names = c(NA, -4L))

这是一个完全基于 R 的方法,它使用 strsplit 来分隔三个组件:

output <- apply(df, 1, function(x) { strsplit(x, "(?:: | \(|\))")})
output <- unlist(output, recursive=FALSE)
names(output) <- c(1:length(output))
df <- as.data.frame(do.call(rbind, output))
names(df) <- c("state", "county", "zip")
df

  state             county   zip
1    MA     Bristol County 25005
2    LA St. Tammany Parish 22103

数据:

df <- data.frame(state=c("MA: Bristol County (25005)",
                         "LA: St. Tammany Parish (22103)"),
                 stringsAsFactors=FALSE)