如何从 R 中的复杂字符串中提取多个新列(str_sub 似乎没有用)
How to extract multiple new columns from a complex string in R (str_sub does not seem to do the trick)
我正在努力从 R 中的字符串中提取多个变量。
该列如下所示:
7 digit identifier
_NAME
:4 digit value
-4 digit value
-4 digit value
-location1
-location2
:7-digit identifier
_junk
_junk
_3 digit value with junk attached
例如:
1234567_NAME:0011-1234-0176-town-car:1234567_000001_original_010qyz
我需要新的专栏:
7 位标识符(“1234567”)
姓名
- 4 位数值中的每一个
- 每个位置
- 3 位数值
str_sub()
不起作用,因为字符串某些部分的长度是可变的。
我尝试了 gsub
,但由于某些特殊字符会重复多次(即“:”和“-”),我无法使用它们来提取字符串的定义明确的部分。
为避免冗长的正则表达式,一种选择是使用 str_split_fixed
将列拆分为一个矩阵,以 [_:-]
作为分隔符,删除不需要的列并从最后一个列中提取数值专栏:
s <- "1234567_NAME:0011-1234-0176-town-car:1234567_000001_original_010qyz"
ss <- c(s,s,s)
library(stringr)
mat <- str_split_fixed(ss, "[_:-]", 11)[,-c(9, 10)]
mat
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] "1234567" "NAME" "0011" "1234" "0176" "town" "car" "1234567" "010qyz"
#[2,] "1234567" "NAME" "0011" "1234" "0176" "town" "car" "1234567" "010qyz"
#[3,] "1234567" "NAME" "0011" "1234" "0176" "town" "car" "1234567" "010qyz"
mat[,9] <- sub("(\d{3}).*", "\1", mat[,9])
mat
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] "1234567" "NAME" "0011" "1234" "0176" "town" "car" "1234567" "010"
#[2,] "1234567" "NAME" "0011" "1234" "0176" "town" "car" "1234567" "010"
#[3,] "1234567" "NAME" "0011" "1234" "0176" "town" "car" "1234567" "010"
如果您保留 "lengthy" 正则表达式,那么您可以这样做(并在稍后添加记录验证,因为您已经嵌入了字段长度):
library(stringi)
library(purrr)
pat <- "(.{7})_([[:alnum:][:space:]]+):([[:digit:]]{4})-([[:digit:]]{4})-([[:digit:]]{4})-([[:alnum:][:space:]]+)-([[:alnum:][:space:]]+):([[:digit:]]{7})_[[:alnum:][:space:]]+_[[:alnum:][:space:]]+_([[:digit:]]{3})"
dat <- "1234567_NAME:0011-1234-0176-town-car:1234567_000001_original_010qyz"
dat <- rep(dat, 10)
cols <- c("id", "name", "val1", "val2", "val3", "loc1", "loc2", "val3")
stri_match_all_regex(dat, pat) %>%
map_df(~setNames(as.list(.[,c(2:8,10)]), cols))
## # A tibble: 10 x 7
## id name val1 val2 val3 loc1 loc2
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1234567 NAME 0011 1234 010 town car
## 2 1234567 NAME 0011 1234 010 town car
## 3 1234567 NAME 0011 1234 010 town car
## 4 1234567 NAME 0011 1234 010 town car
## 5 1234567 NAME 0011 1234 010 town car
## 6 1234567 NAME 0011 1234 010 town car
## 7 1234567 NAME 0011 1234 010 town car
## 8 1234567 NAME 0011 1234 010 town car
## 9 1234567 NAME 0011 1234 010 town car
## 10 1234567 NAME 0011 1234 010 town car
我正在努力从 R 中的字符串中提取多个变量。
该列如下所示:
7 digit identifier
_NAME
:4 digit value
-4 digit value
-4 digit value
-location1
-location2
:7-digit identifier
_junk
_junk
_3 digit value with junk attached
例如:
1234567_NAME:0011-1234-0176-town-car:1234567_000001_original_010qyz
我需要新的专栏:
7 位标识符(“1234567”)
姓名
- 4 位数值中的每一个
- 每个位置
- 3 位数值
str_sub()
不起作用,因为字符串某些部分的长度是可变的。
我尝试了 gsub
,但由于某些特殊字符会重复多次(即“:”和“-”),我无法使用它们来提取字符串的定义明确的部分。
为避免冗长的正则表达式,一种选择是使用 str_split_fixed
将列拆分为一个矩阵,以 [_:-]
作为分隔符,删除不需要的列并从最后一个列中提取数值专栏:
s <- "1234567_NAME:0011-1234-0176-town-car:1234567_000001_original_010qyz"
ss <- c(s,s,s)
library(stringr)
mat <- str_split_fixed(ss, "[_:-]", 11)[,-c(9, 10)]
mat
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] "1234567" "NAME" "0011" "1234" "0176" "town" "car" "1234567" "010qyz"
#[2,] "1234567" "NAME" "0011" "1234" "0176" "town" "car" "1234567" "010qyz"
#[3,] "1234567" "NAME" "0011" "1234" "0176" "town" "car" "1234567" "010qyz"
mat[,9] <- sub("(\d{3}).*", "\1", mat[,9])
mat
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] "1234567" "NAME" "0011" "1234" "0176" "town" "car" "1234567" "010"
#[2,] "1234567" "NAME" "0011" "1234" "0176" "town" "car" "1234567" "010"
#[3,] "1234567" "NAME" "0011" "1234" "0176" "town" "car" "1234567" "010"
如果您保留 "lengthy" 正则表达式,那么您可以这样做(并在稍后添加记录验证,因为您已经嵌入了字段长度):
library(stringi)
library(purrr)
pat <- "(.{7})_([[:alnum:][:space:]]+):([[:digit:]]{4})-([[:digit:]]{4})-([[:digit:]]{4})-([[:alnum:][:space:]]+)-([[:alnum:][:space:]]+):([[:digit:]]{7})_[[:alnum:][:space:]]+_[[:alnum:][:space:]]+_([[:digit:]]{3})"
dat <- "1234567_NAME:0011-1234-0176-town-car:1234567_000001_original_010qyz"
dat <- rep(dat, 10)
cols <- c("id", "name", "val1", "val2", "val3", "loc1", "loc2", "val3")
stri_match_all_regex(dat, pat) %>%
map_df(~setNames(as.list(.[,c(2:8,10)]), cols))
## # A tibble: 10 x 7
## id name val1 val2 val3 loc1 loc2
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1234567 NAME 0011 1234 010 town car
## 2 1234567 NAME 0011 1234 010 town car
## 3 1234567 NAME 0011 1234 010 town car
## 4 1234567 NAME 0011 1234 010 town car
## 5 1234567 NAME 0011 1234 010 town car
## 6 1234567 NAME 0011 1234 010 town car
## 7 1234567 NAME 0011 1234 010 town car
## 8 1234567 NAME 0011 1234 010 town car
## 9 1234567 NAME 0011 1234 010 town car
## 10 1234567 NAME 0011 1234 010 town car