用数字、字符和特殊符号拆分复杂的字符串
Splitting complicated strings with number, characters and special signs
我有一个带有一个变量的数据框,如下所示:
rownr country
22 Bolivia 0.16 0.16 4.63* 22.10* 450
23 Mozambique 1.11 19.22* 0.19 12.38* 486
24 Germany 0.77 6.06* 0.53 15.35* 630
25 Bosnia & Herzegovina 0.72 6.84* 1.03 21.60* 889
我想将它分成六个独立的变量,如下所示:
rownr country number 2 3 4 5
22 Bolivia 0.16 0.16 4.63* 22.10* 450
23 Mozambique 1.11 19.22* 0.19 12.38* 486
24 Germany 0.77 6.06* 0.53 15.35* 630
25 Bosnia & Herzegovina 0.72 6.84* 1.03 21.60* 889
这是我试过的:
names(df)[1] <- "Strng"
df <- setDT(df)[, paste0("RA", 1:8) := tstrsplit(Strng, " ", type.convert = TRUE, fixed = TRUE)]
df$country <- gsub("[[:digit:]]","",df$Strng)
df$country <- gsub("[[:punct:]]","",df$country)
df$numbers <- gsub("[[:alpha:]]"," ",df$Strng)
df <- select(df, RA1:RA5)
names(df)[1] <- "country"
names(df)[2] <- "number"
df$numberss <- strsplit(df$numbers, split=" ", fixed = FALSE, perl = FALSE, useBytes = FALSE)
df <- setDT(df)[, paste0("RA", 1:5) := tstrsplit(numbers, " ", type.convert = TRUE, fixed = TRUE)]
这导致:
rownr country number 3 4 5
22 Bolivia 0.16 0.16 4.63* 22.10*
23 Mozambique 1.11 19.22* 0.19 12.38*
24 Germany 0.77 6.06* 0.53 15.35*
25 Bosnia & Herzegovina 0.72 6.84*
我不知道该怎么做。有什么建议吗?
使用正向前瞻,我们只能在 space \s
后跟一个数字 (?=\d)
进行拆分
library(tidyr)
df%>% mutate_if(is.factor,as.character) %>%
separate(country,sep = '\s(?=\d)', into = c('country','number','2','3','4','5' ))
country number 2 3 4 5
1 Bolivia 0.16 0.16 4.63* 22.10* 450
2 Germany 0.77 6.06* 0.53 15.35* 630
3 Bosnia & Herzegovina 0.72 6.84* 1.03 21.60* 889
我有一个带有一个变量的数据框,如下所示:
rownr country
22 Bolivia 0.16 0.16 4.63* 22.10* 450
23 Mozambique 1.11 19.22* 0.19 12.38* 486
24 Germany 0.77 6.06* 0.53 15.35* 630
25 Bosnia & Herzegovina 0.72 6.84* 1.03 21.60* 889
我想将它分成六个独立的变量,如下所示:
rownr country number 2 3 4 5
22 Bolivia 0.16 0.16 4.63* 22.10* 450
23 Mozambique 1.11 19.22* 0.19 12.38* 486
24 Germany 0.77 6.06* 0.53 15.35* 630
25 Bosnia & Herzegovina 0.72 6.84* 1.03 21.60* 889
这是我试过的:
names(df)[1] <- "Strng"
df <- setDT(df)[, paste0("RA", 1:8) := tstrsplit(Strng, " ", type.convert = TRUE, fixed = TRUE)]
df$country <- gsub("[[:digit:]]","",df$Strng)
df$country <- gsub("[[:punct:]]","",df$country)
df$numbers <- gsub("[[:alpha:]]"," ",df$Strng)
df <- select(df, RA1:RA5)
names(df)[1] <- "country"
names(df)[2] <- "number"
df$numberss <- strsplit(df$numbers, split=" ", fixed = FALSE, perl = FALSE, useBytes = FALSE)
df <- setDT(df)[, paste0("RA", 1:5) := tstrsplit(numbers, " ", type.convert = TRUE, fixed = TRUE)]
这导致:
rownr country number 3 4 5
22 Bolivia 0.16 0.16 4.63* 22.10*
23 Mozambique 1.11 19.22* 0.19 12.38*
24 Germany 0.77 6.06* 0.53 15.35*
25 Bosnia & Herzegovina 0.72 6.84*
我不知道该怎么做。有什么建议吗?
使用正向前瞻,我们只能在 space \s
后跟一个数字 (?=\d)
library(tidyr)
df%>% mutate_if(is.factor,as.character) %>%
separate(country,sep = '\s(?=\d)', into = c('country','number','2','3','4','5' ))
country number 2 3 4 5
1 Bolivia 0.16 0.16 4.63* 22.10* 450
2 Germany 0.77 6.06* 0.53 15.35* 630
3 Bosnia & Herzegovina 0.72 6.84* 1.03 21.60* 889