我正在尝试使用 stringr,特别是正则表达式,来切割 "MA: Bristol County (25005)"
I'm trying to use stringr, specifically regex, to cut up "MA: Bristol County (25005)"
我正在尝试获取可变列并将其分成几列。这些值遵循基本模式,县名具有多种长度和格式。
State-county :
[1] "MA: Bristol County (25005)"
[2] "LA: St. Tammany Parish (22103)"
[3] "CA: Ventura County (06111)"
[4] "CA: San Mateo County (06081)"
我需要可以添加回 data.frame 的州、县名称和县代码列。一直想弄清楚如何使用 str_extract 来完成任务。
理想情况下,这就是我最终的目标,但我会接受我能得到的任何帮助。
state: county: county code:
[1] "MA" Bristol County 25005
[2] "LA" St. Tammany Parish 22103
[3] "CA" Ventura County 06111
[4] "CA: San Mateo County 06081
我能够使用我发现 str_extract_all( "(?<=\().+?(?=\))")
的代码作为县代码(感谢 Nettle),我能到达的最接近州的缩写是
'str_extract_all( h,"..:")
这很接近但包括“:”
也尝试过:str_extract_all( "(?<=\:")
抱歉,如果这不是最好的格式,我已经尽力以我见过的风格表达清楚了。
使用str_match_all
:
str_match_all(df$State_county, "([A-Z]+): ([^()]+) \((\d+)\)")
as_tibble(df) %>%
mutate(matches=str_match_all(State_county, "([A-Z]+): ([^()]+) \((\d+)\)")) %>%
unnest_wider(matches) %>%
select(-2) %>%
set_names("State_county", "State", "County", "ZIP")
# A tibble: 4 x 4
State_county State County ZIP
<fct> <chr> <chr> <chr>
1 MA: Bristol County (25005) MA Bristol County 25005
2 LA: St. Tammany Parish (22103) LA St. Tammany Parish 22103
3 CA: Ventura County (06111) CA Ventura County 06111
4 CA: San Mateo County (06081) CA San Mateo County 06081
### OR with str_match as we're only using a single pattern
## this saves us from the warning caused by unnest_wider
as_tibble(df) %>%
mutate(matches=str_match(State_county, "([A-Z]+): ([^()]+) \((\d+)\)"), State=matches[,2], County=matches[,3], ZIP=matches[,4], matches=NULL)
# A tibble: 4 x 4
State_county State County ZIP
<fct> <chr> <chr> <chr>
1 MA: Bristol County (25005) MA Bristol County 25005
2 LA: St. Tammany Parish (22103) LA St. Tammany Parish 22103
3 CA: Ventura County (06111) CA Ventura County 06111
4 CA: San Mateo County (06081) CA San Mateo County 06081
### Another way
str_match(df$State_county, "([A-Z]+): ([^()]+) \((\d+)\)") %>%
as.data.frame %>% set_names("State_county", "State", "County", "County_code")
State_county State County County_code
1 MA: Bristol County (25005) MA Bristol County 25005
2 LA: St. Tammany Parish (22103) LA St. Tammany Parish 22103
3 CA: Ventura County (06111) CA Ventura County 06111
4 CA: San Mateo County (06081) CA San Mateo County 06081
解释:
str_match
基本上 return 捕获的组(写在非转义括号 ([A-Z]+)
中的子模式)和匹配完整模式的完整字符串
[A-Z]+
: 匹配状态缩写。
[^()]+
:匹配任何不是左括号的内容。县.
\((\d+)\)
:匹配左括号 \(
并在使用分组拉数字时关闭一个。县代码。
str_match(df$State_county, "([A-Z]+): ([^()]+) \((\d+)\)")
[,1] [,2] [,3] [,4]
[1,] "MA: Bristol County (25005)" "MA" "Bristol County" "25005"
[2,] "LA: St. Tammany Parish (22103)" "LA" "St. Tammany Parish" "22103"
[3,] "CA: Ventura County (06111)" "CA" "Ventura County" "06111"
[4,] "CA: San Mateo County (06081)" "CA" "San Mateo County" "06081"
您可以使用 tidyr
的 extract
将数据放入不同的列中,指定要使用的正则表达式来划分数据。
tidyr::extract(df, col,
c('state', 'county', 'county_code'),
'(\w+):\s*(.*)\((\d+)\)')
# state county county_code
#1 MA Bristol County 25005
#2 LA St. Tammany Parish 22103
#3 CA Ventura County 06111
#4 CA San Mateo County 06081
我们使用 3 个捕获组从 col
列中提取数据。
数据
df <- structure(list(col = c("MA: Bristol County (25005)",
"LA: St. Tammany Parish (22103)",
"CA: Ventura County (06111)", "CA: San Mateo County (06081)")),
class = "data.frame", row.names = c(NA, -4L))
这是一个完全基于 R 的方法,它使用 strsplit
来分隔三个组件:
output <- apply(df, 1, function(x) { strsplit(x, "(?:: | \(|\))")})
output <- unlist(output, recursive=FALSE)
names(output) <- c(1:length(output))
df <- as.data.frame(do.call(rbind, output))
names(df) <- c("state", "county", "zip")
df
state county zip
1 MA Bristol County 25005
2 LA St. Tammany Parish 22103
数据:
df <- data.frame(state=c("MA: Bristol County (25005)",
"LA: St. Tammany Parish (22103)"),
stringsAsFactors=FALSE)
我正在尝试获取可变列并将其分成几列。这些值遵循基本模式,县名具有多种长度和格式。
State-county :
[1] "MA: Bristol County (25005)"
[2] "LA: St. Tammany Parish (22103)"
[3] "CA: Ventura County (06111)"
[4] "CA: San Mateo County (06081)"
我需要可以添加回 data.frame 的州、县名称和县代码列。一直想弄清楚如何使用 str_extract 来完成任务。 理想情况下,这就是我最终的目标,但我会接受我能得到的任何帮助。
state: county: county code:
[1] "MA" Bristol County 25005
[2] "LA" St. Tammany Parish 22103
[3] "CA" Ventura County 06111
[4] "CA: San Mateo County 06081
我能够使用我发现 str_extract_all( "(?<=\().+?(?=\))")
的代码作为县代码(感谢 Nettle),我能到达的最接近州的缩写是
'str_extract_all( h,"..:")
这很接近但包括“:”
也尝试过:str_extract_all( "(?<=\:")
抱歉,如果这不是最好的格式,我已经尽力以我见过的风格表达清楚了。
使用str_match_all
:
str_match_all(df$State_county, "([A-Z]+): ([^()]+) \((\d+)\)")
as_tibble(df) %>%
mutate(matches=str_match_all(State_county, "([A-Z]+): ([^()]+) \((\d+)\)")) %>%
unnest_wider(matches) %>%
select(-2) %>%
set_names("State_county", "State", "County", "ZIP")
# A tibble: 4 x 4
State_county State County ZIP
<fct> <chr> <chr> <chr>
1 MA: Bristol County (25005) MA Bristol County 25005
2 LA: St. Tammany Parish (22103) LA St. Tammany Parish 22103
3 CA: Ventura County (06111) CA Ventura County 06111
4 CA: San Mateo County (06081) CA San Mateo County 06081
### OR with str_match as we're only using a single pattern
## this saves us from the warning caused by unnest_wider
as_tibble(df) %>%
mutate(matches=str_match(State_county, "([A-Z]+): ([^()]+) \((\d+)\)"), State=matches[,2], County=matches[,3], ZIP=matches[,4], matches=NULL)
# A tibble: 4 x 4
State_county State County ZIP
<fct> <chr> <chr> <chr>
1 MA: Bristol County (25005) MA Bristol County 25005
2 LA: St. Tammany Parish (22103) LA St. Tammany Parish 22103
3 CA: Ventura County (06111) CA Ventura County 06111
4 CA: San Mateo County (06081) CA San Mateo County 06081
### Another way
str_match(df$State_county, "([A-Z]+): ([^()]+) \((\d+)\)") %>%
as.data.frame %>% set_names("State_county", "State", "County", "County_code")
State_county State County County_code
1 MA: Bristol County (25005) MA Bristol County 25005
2 LA: St. Tammany Parish (22103) LA St. Tammany Parish 22103
3 CA: Ventura County (06111) CA Ventura County 06111
4 CA: San Mateo County (06081) CA San Mateo County 06081
解释:
str_match
基本上 return 捕获的组(写在非转义括号 ([A-Z]+)
中的子模式)和匹配完整模式的完整字符串
[A-Z]+
: 匹配状态缩写。[^()]+
:匹配任何不是左括号的内容。县.\((\d+)\)
:匹配左括号\(
并在使用分组拉数字时关闭一个。县代码。
str_match(df$State_county, "([A-Z]+): ([^()]+) \((\d+)\)")
[,1] [,2] [,3] [,4]
[1,] "MA: Bristol County (25005)" "MA" "Bristol County" "25005"
[2,] "LA: St. Tammany Parish (22103)" "LA" "St. Tammany Parish" "22103"
[3,] "CA: Ventura County (06111)" "CA" "Ventura County" "06111"
[4,] "CA: San Mateo County (06081)" "CA" "San Mateo County" "06081"
您可以使用 tidyr
的 extract
将数据放入不同的列中,指定要使用的正则表达式来划分数据。
tidyr::extract(df, col,
c('state', 'county', 'county_code'),
'(\w+):\s*(.*)\((\d+)\)')
# state county county_code
#1 MA Bristol County 25005
#2 LA St. Tammany Parish 22103
#3 CA Ventura County 06111
#4 CA San Mateo County 06081
我们使用 3 个捕获组从 col
列中提取数据。
数据
df <- structure(list(col = c("MA: Bristol County (25005)",
"LA: St. Tammany Parish (22103)",
"CA: Ventura County (06111)", "CA: San Mateo County (06081)")),
class = "data.frame", row.names = c(NA, -4L))
这是一个完全基于 R 的方法,它使用 strsplit
来分隔三个组件:
output <- apply(df, 1, function(x) { strsplit(x, "(?:: | \(|\))")})
output <- unlist(output, recursive=FALSE)
names(output) <- c(1:length(output))
df <- as.data.frame(do.call(rbind, output))
names(df) <- c("state", "county", "zip")
df
state county zip
1 MA Bristol County 25005
2 LA St. Tammany Parish 22103
数据:
df <- data.frame(state=c("MA: Bristol County (25005)",
"LA: St. Tammany Parish (22103)"),
stringsAsFactors=FALSE)