R:提取值并插入 3 个现有列

R: extract value and insert in 3 existing columns

我有一个像下面这样的大型数据集,我正在尝试根据“国家/地区”列向 3 列添加值。

Country<-c("Asia","Africa - Benin (Cotonou)",
           "Europe - France (Paris)","Asia - China(Shanghai)", "Europe - United Kingdom (London)", "Europe - France (Orléans)"
           , "Afrique - Togo (Lomé)", "Afrique - Sénégal (Dakar)", "Asia - Pakistan (Rahim Yar Khan)")

ID<-c(1,2,3,4,5,6,7,8,9)
mydata<-data.frame(ID,Country)


 > mydata
>   ID                          Country         col1     col2     col3 
> 1  1                             Asia
> 2  2         Africa - Benin (Cotonou)
> 3  3          Europe - France (Paris)
> 4  4           Asia - China(Shanghai)
> 5  5 Europe - United Kingdom (London)
> 6  6        Europe - France (Orléans)
> 7  7            Afrique - Togo (Lomé)
> 8  8        Afrique - Sénégal (Dakar)
> 9  9 Asia - Pakistan (Rahim Yar Khan)

我尝试了以下方法,但正则表达式有问题

library(tidyr)
mydata <- mydata %>% separate(col = "Country", into = c("Col1", "Col2", "Col3"), remove = FALSE, fill = "right")
     

我得到的结果如下:

ID     Country                          Col1           Col2     Col3
 1    Asia                              Asia           <NA>     <NA>
 2    Africa - Benin (Cotonou)          Africa         Benin  Cotonou
 3    Europe - France (Paris)           Europe         France  Paris
 4    Asia - China(Shanghai)            Asia           China   Shanghai
 5    Europe - United Kingdom (London)  Europe         United  Kingdom
 6    Europe - France (Orléans)         Europe         France  Orl
 7    Afrique - Togo (Lomé)             Afrique        Togo      L
 8     Afrique - Sénégal (Dakar)        Afrique        S         n
 9 Asia - Pakistan (Rahim Yar Khan)     Asia           Pakistan   Rahim

第 3 列第 5、6、7、8 和 9 行缺少某些部分。

我想要的结果如下:

ID     Country                          Col1           Col2                Col3
     1    Asia                              Asia           <NA>            <NA>
     2    Africa - Benin (Cotonou)          Africa         Benin            Cotonou
     3    Europe - France (Paris)           Europe         France           Paris
     4    Asia - China(Shanghai)            Asia           China            Shanghai
     5    Europe - United Kingdom (London)  Europe         United Kingdom    London
     6    Europe - France (Orléans)         Europe         France            Orléans
     7    Afrique - Togo (Lomé)             Afrique        Togo              Lomé
     8     Afrique - Sénégal (Dakar)        Afrique        Sénégal           Dakar
     9 Asia - Pakistan (Rahim Yar Khan)     Asia           Pakistan          Rahim Yar Khan

关于如何做到这一点有什么建议吗?

更新: 要删除多余的空格,我们可以在代码末尾添加这一行: mutate(across(starts_with("col"), str_squish))

我们可以将第一个分隔符 - 替换为 ( 然后 我们得到一个分隔符。 后记做separate最后去掉剩下的)

library(dplyr)
library(stringr)
library(tidyr)

  ID    col1           col2           col3
1  1    Asia           <NA>           <NA>
2  2  Africa          Benin        Cotonou
3  3  Europe         France          Paris
4  4    Asia          China       Shanghai
5  5  Europe United Kingdom         London
6  6  Europe         France        Orléans
7  7 Afrique           Togo           Lomé
8  8 Afrique        Sénégal          Dakar
9  9    Asia       Pakistan Rahim Yar Khan

tidyr::separate 将根据分隔符(默认情况下是任何非 alpha-numeric)将文本分隔成列,因此默认情况下它以空格分隔。您可以使用 extra 参数将所有剩余文本合并到第 3 列,如下所示:

mydata %>% 
    separate(Country, 
            into = c("Col1", "Col2", "Col3"),
            extra = "merge")
  ID    Col1     Col2             Col3
1  1    Asia     <NA>             <NA>
2  2  Africa    Benin         Cotonou)
3  3  Europe   France           Paris)
4  4    Asia    China        Shanghai)
5  5  Europe   United Kingdom (London)
6  6  Europe   France         Orléans)
7  7 Afrique     Togo            Lomé)
8  8 Afrique  Sénégal           Dakar)
9  9    Asia Pakistan  Rahim Yar Khan)
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1]. 

但是,这样我们就在最后得到了一个不必要的 ) 。您可以通过 mutate 删除它,也可以使用允许基于正则表达式提取的 tidyr::extract 而不是 separate

mydata %>% 
    extract(Country, 
            into = c("Col1", "Col2", "Col3"),
            regex = "([[:alnum:]]+) - ([[:alnum:]]+) ?\((.*)\)")
  ID    Col1     Col2           Col3
1  1    <NA>     <NA>           <NA>
2  2  Africa    Benin        Cotonou
3  3  Europe   France          Paris
4  4    Asia    China       Shanghai
5  5    <NA>     <NA>           <NA>
6  6  Europe   France        Orléans
7  7 Afrique     Togo           Lomé
8  8 Afrique  Sénégal          Dakar
9  9    Asia Pakistan Rahim Yar Khan

这是我的第一次投稿,如有错误请见谅。 我是这样做的,可能不是最简单的方法,但我想它奏效了:

mydata %>% 
  separate(col = "Country",
           sep = "[\(-]",
           into = c("Col1", "Col2", "Col3"),
           remove = FALSE,
           fill = "right") %>% 
  mutate(Col3 = str_remove(Col3, "\)"))
library(dplyr)
library(tidyr)

mydata %>%
  separate(Country, into = c("col1", "col2", "col3"), '( - | ?\()', remove = FALSE) %>%
  mutate(col3 = gsub(')', '', col3))

#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
#>   ID                          Country    col1           col2           col3
#> 1  1                             Asia    Asia           <NA>           <NA>
#> 2  2         Africa - Benin (Cotonou)  Africa          Benin        Cotonou
#> 3  3          Europe - France (Paris)  Europe         France          Paris
#> 4  4           Asia - China(Shanghai)    Asia          China       Shanghai
#> 5  5 Europe - United Kingdom (London)  Europe United Kingdom         London
#> 6  6        Europe - France (Orléans)  Europe         France        Orléans
#> 7  7            Afrique - Togo (Lomé) Afrique           Togo           Lomé
#> 8  8        Afrique - Sénégal (Dakar) Afrique        Sénégal          Dakar
#> 9  9 Asia - Pakistan (Rahim Yar Khan)    Asia       Pakistan Rahim Yar Khan

一个data.table解决方案:

require(data.table)
setDT(mydata)

splitCountry <- function( c_str ) {
  
  vec <- trimws(unlist(strsplit(as.character(c_str),"[[:punct:]]")))
  col1 <- vec[1]
  col2 <- vec[2]
  col3 <- vec[3]
  
  return(list(col1,
              col2,
              col3))
  
}

mydata[,c('col1','col2','col3'):=splitCountry(Country),by=Country]