R用正则表达式提取变量

R extract variables with regex

我有一个字符列需要用正则表达式分隔。这是原始数据的示例:

data_raw <- tribble(
  ~census_geo,
  "Division No.  1, Subd. V (SNO), Newfoundland and Labrador",
  "Portugal Cove South (T), Newfoundland and Labrador",
  "Division No.  1, Subd. U, Reserve (SNO), Newfoundland and Labrador")

我们要提取三列。第一个是括号前的所有内容。第二列是括号内的单词。最后一列是最后一个逗号之后的所有内容(或括号中单词之后的所有内容)。这是干净输出的示例:

data_clean <- tribble(
  ~csd_name, ~csd_type, ~province,
  "Division No.  1, Subd. V", "SNO", "Newfoundland and Labrador", 
  "Portugal Cove South", "T",  "Ontario", 
  "Division No.  1, Subd. U, Reserve", "SNO", "Newfoundland and Labrador")

我可以用这段代码提取最后一列:

data_raw %>% 
  mutate(csd_type = str_extract(census_geo, pattern = "(?<=\().*(?=\))"))

但我无法获取其他两列。

如有任何帮助,我们将不胜感激。

您可以使用tidyrextract并传递正则表达式来提取不同列中的相关文本。

tidyr::extract(data_raw, census_geo, c('csd_name', 'csd_type', 'province'), 
              '(.*) \((.*)\),\s*(.*)')

#  csd_name                          csd_type province                 
#  <chr>                             <chr>    <chr>                    
#1 Division No.  1, Subd. V          SNO      Newfoundland and Labrador
#2 Portugal Cove South               T        Newfoundland and Labrador
#3 Division No.  1, Subd. U, Reserve SNO      Newfoundland and Labrador

您可以在 base R 中使用 strcapture 获得相同的结果:

strcapture('(.*) \((.*)\),\s*(.*)', data_raw$census_geo, 
   proto = list(csd_name = character(), csd_type = character(), 
                province = character()))

我知道您已经选择了 Ronak Shah 的答案(顺便说一句,这非常好),但我只想展示一种使用 stringrseparate:

的方法
library(stringr)

data_raw %>% 
  separate(
    col = census_geo, 
    into = c('csd_name', 'csd_type', 'province'),
    sep = '(\s\(|\),\s)'
  )

\s为白色space,\(为括号,|为拆分两个不同的模式寻找。

以防万一 OP 有兴趣了解 str_extract 的原始方法如何使用负字符 类 [^)(][^,] 对所有三个单独的列起作用:

data_raw %>% 
  mutate(
    csd_name = str_extract(census_geo, "^[^)(]+(?=\s)"),
    csd_type = str_extract(census_geo, "(?<=\()[^)(]+(?=\))"),
    csd_province = str_extract(census_geo, "(?<=,\s)[^,]+$")) %>%
  select(-census_geo)
# A tibble: 3 x 3
  csd_name                          csd_type csd_province             
  <chr>                             <chr>    <chr>                    
1 Division No.  1, Subd. V          SNO      Newfoundland and Labrador
2 Portugal Cove South               T        Newfoundland and Labrador
3 Division No.  1, Subd. U, Reserve SNO      Newfoundland and Labrador