R用正则表达式提取变量
R extract variables with regex
我有一个字符列需要用正则表达式分隔。这是原始数据的示例:
data_raw <- tribble(
~census_geo,
"Division No. 1, Subd. V (SNO), Newfoundland and Labrador",
"Portugal Cove South (T), Newfoundland and Labrador",
"Division No. 1, Subd. U, Reserve (SNO), Newfoundland and Labrador")
我们要提取三列。第一个是括号前的所有内容。第二列是括号内的单词。最后一列是最后一个逗号之后的所有内容(或括号中单词之后的所有内容)。这是干净输出的示例:
data_clean <- tribble(
~csd_name, ~csd_type, ~province,
"Division No. 1, Subd. V", "SNO", "Newfoundland and Labrador",
"Portugal Cove South", "T", "Ontario",
"Division No. 1, Subd. U, Reserve", "SNO", "Newfoundland and Labrador")
我可以用这段代码提取最后一列:
data_raw %>%
mutate(csd_type = str_extract(census_geo, pattern = "(?<=\().*(?=\))"))
但我无法获取其他两列。
如有任何帮助,我们将不胜感激。
您可以使用tidyr
的extract
并传递正则表达式来提取不同列中的相关文本。
tidyr::extract(data_raw, census_geo, c('csd_name', 'csd_type', 'province'),
'(.*) \((.*)\),\s*(.*)')
# csd_name csd_type province
# <chr> <chr> <chr>
#1 Division No. 1, Subd. V SNO Newfoundland and Labrador
#2 Portugal Cove South T Newfoundland and Labrador
#3 Division No. 1, Subd. U, Reserve SNO Newfoundland and Labrador
您可以在 base R 中使用 strcapture
获得相同的结果:
strcapture('(.*) \((.*)\),\s*(.*)', data_raw$census_geo,
proto = list(csd_name = character(), csd_type = character(),
province = character()))
我知道您已经选择了 Ronak Shah 的答案(顺便说一句,这非常好),但我只想展示一种使用 stringr
的 separate
:
的方法
library(stringr)
data_raw %>%
separate(
col = census_geo,
into = c('csd_name', 'csd_type', 'province'),
sep = '(\s\(|\),\s)'
)
\s
为白色space,\(
为括号,|
为拆分两个不同的模式寻找。
以防万一 OP 有兴趣了解 str_extract
的原始方法如何使用负字符 类 [^)(]
和 [^,]
对所有三个单独的列起作用:
data_raw %>%
mutate(
csd_name = str_extract(census_geo, "^[^)(]+(?=\s)"),
csd_type = str_extract(census_geo, "(?<=\()[^)(]+(?=\))"),
csd_province = str_extract(census_geo, "(?<=,\s)[^,]+$")) %>%
select(-census_geo)
# A tibble: 3 x 3
csd_name csd_type csd_province
<chr> <chr> <chr>
1 Division No. 1, Subd. V SNO Newfoundland and Labrador
2 Portugal Cove South T Newfoundland and Labrador
3 Division No. 1, Subd. U, Reserve SNO Newfoundland and Labrador
我有一个字符列需要用正则表达式分隔。这是原始数据的示例:
data_raw <- tribble(
~census_geo,
"Division No. 1, Subd. V (SNO), Newfoundland and Labrador",
"Portugal Cove South (T), Newfoundland and Labrador",
"Division No. 1, Subd. U, Reserve (SNO), Newfoundland and Labrador")
我们要提取三列。第一个是括号前的所有内容。第二列是括号内的单词。最后一列是最后一个逗号之后的所有内容(或括号中单词之后的所有内容)。这是干净输出的示例:
data_clean <- tribble(
~csd_name, ~csd_type, ~province,
"Division No. 1, Subd. V", "SNO", "Newfoundland and Labrador",
"Portugal Cove South", "T", "Ontario",
"Division No. 1, Subd. U, Reserve", "SNO", "Newfoundland and Labrador")
我可以用这段代码提取最后一列:
data_raw %>%
mutate(csd_type = str_extract(census_geo, pattern = "(?<=\().*(?=\))"))
但我无法获取其他两列。
如有任何帮助,我们将不胜感激。
您可以使用tidyr
的extract
并传递正则表达式来提取不同列中的相关文本。
tidyr::extract(data_raw, census_geo, c('csd_name', 'csd_type', 'province'),
'(.*) \((.*)\),\s*(.*)')
# csd_name csd_type province
# <chr> <chr> <chr>
#1 Division No. 1, Subd. V SNO Newfoundland and Labrador
#2 Portugal Cove South T Newfoundland and Labrador
#3 Division No. 1, Subd. U, Reserve SNO Newfoundland and Labrador
您可以在 base R 中使用 strcapture
获得相同的结果:
strcapture('(.*) \((.*)\),\s*(.*)', data_raw$census_geo,
proto = list(csd_name = character(), csd_type = character(),
province = character()))
我知道您已经选择了 Ronak Shah 的答案(顺便说一句,这非常好),但我只想展示一种使用 stringr
的 separate
:
library(stringr)
data_raw %>%
separate(
col = census_geo,
into = c('csd_name', 'csd_type', 'province'),
sep = '(\s\(|\),\s)'
)
\s
为白色space,\(
为括号,|
为拆分两个不同的模式寻找。
以防万一 OP 有兴趣了解 str_extract
的原始方法如何使用负字符 类 [^)(]
和 [^,]
对所有三个单独的列起作用:
data_raw %>%
mutate(
csd_name = str_extract(census_geo, "^[^)(]+(?=\s)"),
csd_type = str_extract(census_geo, "(?<=\()[^)(]+(?=\))"),
csd_province = str_extract(census_geo, "(?<=,\s)[^,]+$")) %>%
select(-census_geo)
# A tibble: 3 x 3
csd_name csd_type csd_province
<chr> <chr> <chr>
1 Division No. 1, Subd. V SNO Newfoundland and Labrador
2 Portugal Cove South T Newfoundland and Labrador
3 Division No. 1, Subd. U, Reserve SNO Newfoundland and Labrador