R:提取值并插入 3 个现有列
R: extract value and insert in 3 existing columns
我有一个像下面这样的大型数据集,我正在尝试根据“国家/地区”列向 3 列添加值。
Country<-c("Asia","Africa - Benin (Cotonou)",
"Europe - France (Paris)","Asia - China(Shanghai)", "Europe - United Kingdom (London)", "Europe - France (Orléans)"
, "Afrique - Togo (Lomé)", "Afrique - Sénégal (Dakar)", "Asia - Pakistan (Rahim Yar Khan)")
ID<-c(1,2,3,4,5,6,7,8,9)
mydata<-data.frame(ID,Country)
> mydata
> ID Country col1 col2 col3
> 1 1 Asia
> 2 2 Africa - Benin (Cotonou)
> 3 3 Europe - France (Paris)
> 4 4 Asia - China(Shanghai)
> 5 5 Europe - United Kingdom (London)
> 6 6 Europe - France (Orléans)
> 7 7 Afrique - Togo (Lomé)
> 8 8 Afrique - Sénégal (Dakar)
> 9 9 Asia - Pakistan (Rahim Yar Khan)
我尝试了以下方法,但正则表达式有问题
library(tidyr)
mydata <- mydata %>% separate(col = "Country", into = c("Col1", "Col2", "Col3"), remove = FALSE, fill = "right")
我得到的结果如下:
ID Country Col1 Col2 Col3
1 Asia Asia <NA> <NA>
2 Africa - Benin (Cotonou) Africa Benin Cotonou
3 Europe - France (Paris) Europe France Paris
4 Asia - China(Shanghai) Asia China Shanghai
5 Europe - United Kingdom (London) Europe United Kingdom
6 Europe - France (Orléans) Europe France Orl
7 Afrique - Togo (Lomé) Afrique Togo L
8 Afrique - Sénégal (Dakar) Afrique S n
9 Asia - Pakistan (Rahim Yar Khan) Asia Pakistan Rahim
第 3 列第 5、6、7、8 和 9 行缺少某些部分。
我想要的结果如下:
ID Country Col1 Col2 Col3
1 Asia Asia <NA> <NA>
2 Africa - Benin (Cotonou) Africa Benin Cotonou
3 Europe - France (Paris) Europe France Paris
4 Asia - China(Shanghai) Asia China Shanghai
5 Europe - United Kingdom (London) Europe United Kingdom London
6 Europe - France (Orléans) Europe France Orléans
7 Afrique - Togo (Lomé) Afrique Togo Lomé
8 Afrique - Sénégal (Dakar) Afrique Sénégal Dakar
9 Asia - Pakistan (Rahim Yar Khan) Asia Pakistan Rahim Yar Khan
关于如何做到这一点有什么建议吗?
更新: 要删除多余的空格,我们可以在代码末尾添加这一行:
mutate(across(starts_with("col"), str_squish))
我们可以将第一个分隔符 -
替换为 (
然后
我们得到一个分隔符。
后记做separate
最后去掉剩下的)
library(dplyr)
library(stringr)
library(tidyr)
ID col1 col2 col3
1 1 Asia <NA> <NA>
2 2 Africa Benin Cotonou
3 3 Europe France Paris
4 4 Asia China Shanghai
5 5 Europe United Kingdom London
6 6 Europe France Orléans
7 7 Afrique Togo Lomé
8 8 Afrique Sénégal Dakar
9 9 Asia Pakistan Rahim Yar Khan
tidyr::separate
将根据分隔符(默认情况下是任何非 alpha-numeric)将文本分隔成列,因此默认情况下它以空格分隔。您可以使用 extra
参数将所有剩余文本合并到第 3 列,如下所示:
mydata %>%
separate(Country,
into = c("Col1", "Col2", "Col3"),
extra = "merge")
ID Col1 Col2 Col3
1 1 Asia <NA> <NA>
2 2 Africa Benin Cotonou)
3 3 Europe France Paris)
4 4 Asia China Shanghai)
5 5 Europe United Kingdom (London)
6 6 Europe France Orléans)
7 7 Afrique Togo Lomé)
8 8 Afrique Sénégal Dakar)
9 9 Asia Pakistan Rahim Yar Khan)
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
但是,这样我们就在最后得到了一个不必要的 ) 。您可以通过 mutate 删除它,也可以使用允许基于正则表达式提取的 tidyr::extract
而不是 separate
:
mydata %>%
extract(Country,
into = c("Col1", "Col2", "Col3"),
regex = "([[:alnum:]]+) - ([[:alnum:]]+) ?\((.*)\)")
ID Col1 Col2 Col3
1 1 <NA> <NA> <NA>
2 2 Africa Benin Cotonou
3 3 Europe France Paris
4 4 Asia China Shanghai
5 5 <NA> <NA> <NA>
6 6 Europe France Orléans
7 7 Afrique Togo Lomé
8 8 Afrique Sénégal Dakar
9 9 Asia Pakistan Rahim Yar Khan
这是我的第一次投稿,如有错误请见谅。
我是这样做的,可能不是最简单的方法,但我想它奏效了:
mydata %>%
separate(col = "Country",
sep = "[\(-]",
into = c("Col1", "Col2", "Col3"),
remove = FALSE,
fill = "right") %>%
mutate(Col3 = str_remove(Col3, "\)"))
library(dplyr)
library(tidyr)
mydata %>%
separate(Country, into = c("col1", "col2", "col3"), '( - | ?\()', remove = FALSE) %>%
mutate(col3 = gsub(')', '', col3))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
#> ID Country col1 col2 col3
#> 1 1 Asia Asia <NA> <NA>
#> 2 2 Africa - Benin (Cotonou) Africa Benin Cotonou
#> 3 3 Europe - France (Paris) Europe France Paris
#> 4 4 Asia - China(Shanghai) Asia China Shanghai
#> 5 5 Europe - United Kingdom (London) Europe United Kingdom London
#> 6 6 Europe - France (Orléans) Europe France Orléans
#> 7 7 Afrique - Togo (Lomé) Afrique Togo Lomé
#> 8 8 Afrique - Sénégal (Dakar) Afrique Sénégal Dakar
#> 9 9 Asia - Pakistan (Rahim Yar Khan) Asia Pakistan Rahim Yar Khan
一个data.table
解决方案:
require(data.table)
setDT(mydata)
splitCountry <- function( c_str ) {
vec <- trimws(unlist(strsplit(as.character(c_str),"[[:punct:]]")))
col1 <- vec[1]
col2 <- vec[2]
col3 <- vec[3]
return(list(col1,
col2,
col3))
}
mydata[,c('col1','col2','col3'):=splitCountry(Country),by=Country]
我有一个像下面这样的大型数据集,我正在尝试根据“国家/地区”列向 3 列添加值。
Country<-c("Asia","Africa - Benin (Cotonou)",
"Europe - France (Paris)","Asia - China(Shanghai)", "Europe - United Kingdom (London)", "Europe - France (Orléans)"
, "Afrique - Togo (Lomé)", "Afrique - Sénégal (Dakar)", "Asia - Pakistan (Rahim Yar Khan)")
ID<-c(1,2,3,4,5,6,7,8,9)
mydata<-data.frame(ID,Country)
> mydata
> ID Country col1 col2 col3
> 1 1 Asia
> 2 2 Africa - Benin (Cotonou)
> 3 3 Europe - France (Paris)
> 4 4 Asia - China(Shanghai)
> 5 5 Europe - United Kingdom (London)
> 6 6 Europe - France (Orléans)
> 7 7 Afrique - Togo (Lomé)
> 8 8 Afrique - Sénégal (Dakar)
> 9 9 Asia - Pakistan (Rahim Yar Khan)
我尝试了以下方法,但正则表达式有问题
library(tidyr)
mydata <- mydata %>% separate(col = "Country", into = c("Col1", "Col2", "Col3"), remove = FALSE, fill = "right")
我得到的结果如下:
ID Country Col1 Col2 Col3
1 Asia Asia <NA> <NA>
2 Africa - Benin (Cotonou) Africa Benin Cotonou
3 Europe - France (Paris) Europe France Paris
4 Asia - China(Shanghai) Asia China Shanghai
5 Europe - United Kingdom (London) Europe United Kingdom
6 Europe - France (Orléans) Europe France Orl
7 Afrique - Togo (Lomé) Afrique Togo L
8 Afrique - Sénégal (Dakar) Afrique S n
9 Asia - Pakistan (Rahim Yar Khan) Asia Pakistan Rahim
第 3 列第 5、6、7、8 和 9 行缺少某些部分。
我想要的结果如下:
ID Country Col1 Col2 Col3
1 Asia Asia <NA> <NA>
2 Africa - Benin (Cotonou) Africa Benin Cotonou
3 Europe - France (Paris) Europe France Paris
4 Asia - China(Shanghai) Asia China Shanghai
5 Europe - United Kingdom (London) Europe United Kingdom London
6 Europe - France (Orléans) Europe France Orléans
7 Afrique - Togo (Lomé) Afrique Togo Lomé
8 Afrique - Sénégal (Dakar) Afrique Sénégal Dakar
9 Asia - Pakistan (Rahim Yar Khan) Asia Pakistan Rahim Yar Khan
关于如何做到这一点有什么建议吗?
更新: 要删除多余的空格,我们可以在代码末尾添加这一行:
mutate(across(starts_with("col"), str_squish))
我们可以将第一个分隔符 -
替换为 (
然后
我们得到一个分隔符。
后记做separate
最后去掉剩下的)
library(dplyr)
library(stringr)
library(tidyr)
ID col1 col2 col3
1 1 Asia <NA> <NA>
2 2 Africa Benin Cotonou
3 3 Europe France Paris
4 4 Asia China Shanghai
5 5 Europe United Kingdom London
6 6 Europe France Orléans
7 7 Afrique Togo Lomé
8 8 Afrique Sénégal Dakar
9 9 Asia Pakistan Rahim Yar Khan
tidyr::separate
将根据分隔符(默认情况下是任何非 alpha-numeric)将文本分隔成列,因此默认情况下它以空格分隔。您可以使用 extra
参数将所有剩余文本合并到第 3 列,如下所示:
mydata %>%
separate(Country,
into = c("Col1", "Col2", "Col3"),
extra = "merge")
ID Col1 Col2 Col3
1 1 Asia <NA> <NA>
2 2 Africa Benin Cotonou)
3 3 Europe France Paris)
4 4 Asia China Shanghai)
5 5 Europe United Kingdom (London)
6 6 Europe France Orléans)
7 7 Afrique Togo Lomé)
8 8 Afrique Sénégal Dakar)
9 9 Asia Pakistan Rahim Yar Khan)
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
但是,这样我们就在最后得到了一个不必要的 ) 。您可以通过 mutate 删除它,也可以使用允许基于正则表达式提取的 tidyr::extract
而不是 separate
:
mydata %>%
extract(Country,
into = c("Col1", "Col2", "Col3"),
regex = "([[:alnum:]]+) - ([[:alnum:]]+) ?\((.*)\)")
ID Col1 Col2 Col3
1 1 <NA> <NA> <NA>
2 2 Africa Benin Cotonou
3 3 Europe France Paris
4 4 Asia China Shanghai
5 5 <NA> <NA> <NA>
6 6 Europe France Orléans
7 7 Afrique Togo Lomé
8 8 Afrique Sénégal Dakar
9 9 Asia Pakistan Rahim Yar Khan
这是我的第一次投稿,如有错误请见谅。 我是这样做的,可能不是最简单的方法,但我想它奏效了:
mydata %>%
separate(col = "Country",
sep = "[\(-]",
into = c("Col1", "Col2", "Col3"),
remove = FALSE,
fill = "right") %>%
mutate(Col3 = str_remove(Col3, "\)"))
library(dplyr)
library(tidyr)
mydata %>%
separate(Country, into = c("col1", "col2", "col3"), '( - | ?\()', remove = FALSE) %>%
mutate(col3 = gsub(')', '', col3))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
#> ID Country col1 col2 col3
#> 1 1 Asia Asia <NA> <NA>
#> 2 2 Africa - Benin (Cotonou) Africa Benin Cotonou
#> 3 3 Europe - France (Paris) Europe France Paris
#> 4 4 Asia - China(Shanghai) Asia China Shanghai
#> 5 5 Europe - United Kingdom (London) Europe United Kingdom London
#> 6 6 Europe - France (Orléans) Europe France Orléans
#> 7 7 Afrique - Togo (Lomé) Afrique Togo Lomé
#> 8 8 Afrique - Sénégal (Dakar) Afrique Sénégal Dakar
#> 9 9 Asia - Pakistan (Rahim Yar Khan) Asia Pakistan Rahim Yar Khan
一个data.table
解决方案:
require(data.table)
setDT(mydata)
splitCountry <- function( c_str ) {
vec <- trimws(unlist(strsplit(as.character(c_str),"[[:punct:]]")))
col1 <- vec[1]
col2 <- vec[2]
col3 <- vec[3]
return(list(col1,
col2,
col3))
}
mydata[,c('col1','col2','col3'):=splitCountry(Country),by=Country]