用数字分隔字符串
Separate a string by a number
我正在尝试按型号和引擎分隔我的列 VEHICLE_TYPE
。代码可以是普通的 SQL 或 R 代码。
我的数据是这样的:
MODEL VEHICLE_TYPE
77 Bora Bora 1.6
79 Ducato Ducato 15 120 Multijet
80 Ducato Ducato 15 120 Multijet
87 Astra Astra 1.7 CDTI
88 406 406 2.0 HDi
89 406 406 2.0 HDi
90 Focus C-MAX Focus C-MAX 1.6 TDCi
91 Focus C-MAX Focus C-MAX 1.6 TDCi
92 Focus C-MAX Focus C-MAX 1.6 TDCi
93 Focus C-MAX Focus C-MAX 1.6 TDCi
94 Focus C-MAX Focus C-MAX 1.6 TDCi
97 S-Klasse S 320 CDI
98 S-Klasse S 320 CDI
99 S-Klasse S 320 CDI
我想收到这样的东西:
MODEL VEHICLE TYPE
Bora 1.6
Ducato 15 120 Multijet
... ...
Focus C-Max 1.6 TDCi
问题是,VEHICLE_TYPE 可以有不同的长度和不同数量的空格,我可以用它们来分隔。
我用 gsub 和 regex 试过了,没用,但是 strsplit工作了。与我真正想要的相去甚远,我 运行 没有想法,现在需要一些帮助。
> strsplit(as.character(test$VEHICLE_TYPE)," ")
[[1]]
[1] "Bora" "1.6"
[[2]]
[1] "Ducato" "15" "120" "Multijet"
[[3]]
[1] "Ducato" "15" "120" "Multijet"
[[4]]
[1] "Astra" "1.7" "CDTI"
[[5]]
[1] "406" "2.0" "HDi"
[[6]]
[1] "406" "2.0" "HDi"
[[7]]
[1] "Focus" "C-MAX" "1.6" "TDCi"
[[8]]
[1] "Focus" "C-MAX" "1.6" "TDCi"
[[9]]
[1] "Focus" "C-MAX" "1.6" "TDCi"
[[10]]
[1] "Focus" "C-MAX" "1.6" "TDCi"
[[11]]
[1] "Focus" "C-MAX" "1.6" "TDCi"
[[12]]
[1] "S" "320" "CDI"
我猜有人会知道用比这更简单的正则表达式来做到这一点的方法,但由于我是正则表达式笨蛋,所以这是我的尝试。按 space 拆分,然后折叠第一个 "numeric" 值之后的所有内容。
library( magrittr )
df[['VEHICLE_TYPE']] %<>%
strsplit( " " ) %>%
sapply( function(x) paste(
x[ grep( "[[:digit:]]", x )[1] : length(x) ],
collapse = " " )
)
结果
> df
# # A tibble: 14 × 2
# MODEL VEHICLE_TYPE
# <chr> <chr>
# 1 Bora 1.6
# 2 Ducato 15 120 Multijet
# 3 Ducato 15 120 Multijet
# 4 Astra 1.7 CDTI
# 5 406 406 2.0 HDi
# 6 406 406 2.0 HDi
# 7 Focus C-MAX 1.6 TDCi
# 8 Focus C-MAX 1.6 TDCi
# 9 Focus C-MAX 1.6 TDCi
# 10 Focus C-MAX 1.6 TDCi
# 11 Focus C-MAX 1.6 TDCi
# 12 S-Klasse 320 CDI
# 13 S-Klasse 320 CDI
# 14 S-Klasse 320 CDI
或者,如果您更喜欢按 last 数值而不是 first 进行拆分:
df[['VEHICLE_TYPE']] %<>%
strsplit( " " ) %>%
sapply( function(x) paste(
x[ tail( grep( "[[:digit:]]", x ), 1 ) : length(x) ],
collapse = " " )
)
> df
# # A tibble: 14 × 2
# MODEL VEHICLE_TYPE
# <chr> <chr>
# 1 Bora 1.6
# 2 Ducato 120 Multijet
# 3 Ducato 120 Multijet
# 4 Astra 1.7 CDTI
# 5 406 2.0 HDi
# 6 406 2.0 HDi
# 7 Focus C-MAX 1.6 TDCi
# 8 Focus C-MAX 1.6 TDCi
# 9 Focus C-MAX 1.6 TDCi
# 10 Focus C-MAX 1.6 TDCi
# 11 Focus C-MAX 1.6 TDCi
# 12 S-Klasse 320 CDI
# 13 S-Klasse 320 CDI
# 14 S-Klasse 320 CDI
编辑:如果您有一些行没有任何数值,您可能需要一些额外的修补:
df[['VEHICLE_TYPE']] %<>%
strsplit( " " ) %>%
sapply( function(x) paste(
if( length( grep( "[[:digit:]]", x ) ) > 1L ) {
x[ tail( grep( "[[:digit:]]", x ), 1 ) : length(x) ]
} else { x },
collapse = " " )
)
正则表达式示例
with s(id,model,type) as (
select 77,'Bora','Bora 1.6' from dual union all
select 79,'Ducato','Ducato 15 120 Multijet' from dual union all
select 80 ,'Ducato','Ducato 15 120 Multijet' from dual union all
select 87 ,'Astra','Astra 1.7 CDTI' from dual union all
select 88 ,'406','406 2.0 HDi' from dual union all
select 89 ,'406','406 2.0 HDi' from dual union all
select 90 ,'Focus C-MAX','Focus C-MAX 1.6 TDCi' from dual union all
select 91 ,'Focus C-MAX','Focus C-MAX 1.6 TDCi' from dual union all
select 92 ,'Focus C-MAX','Focus C-MAX 1.6 TDCi' from dual union all
select 93 ,'Focus C-MAX','Focus C-MAX 1.6 TDCi' from dual union all
select 94 ,'Focus C-MAX','Focus C-MAX 1.6 TDCi' from dual union all
select 97 ,'S-Klasse','S 320 CDI' from dual union all
select 98 ,'S-Klasse','S 320 CDI' from dual union all
select 99 ,'S-Klasse','S 320 CDI' from dual
)
select regexp_substr(type,'\d+(\.\d+)?\s*\w*$') /*cut part with model*/
from s
这是使用 gsub
的替代解决方案
df$VEHICLE_TYPE <- gsub(".+ ([0-9.]+(?: [^ ]+)?)$", "\1", df$VEHICLE_TYPE)
> df
# MODEL VEHICLE_TYPE
# 1 Bora 1.6
# 2 Ducato 120 Multijet
# 3 Ducato 120 Multijet
# 4 Astra 1.7 CDTI
# 5 406 2.0 HDi
# 6 406 2.0 HDi
# 7 Focus C-MAX 1.6 TDCi
# 8 Focus C-MAX 1.6 TDCi
# 9 Focus C-MAX 1.6 TDCi
# 10 Focus C-MAX 1.6 TDCi
# 11 Focus C-MAX 1.6 TDCi
# 12 S-Klasse 320 CDI
# 13 S-Klasse 320 CDI
# 14 S-Klasse 320 CDI
我假设车辆类型总是在末尾并遵循以下模式:(1) 一组数字字符(0 到 9 和点),例如1.6
或 (2) 组数字字符和组任何其他字符的组合,由 space 分隔(例如 120 Multijet
、2.0 HDi
)
更新 : 处理 308 1.6i Flex 和 Cherokee 2.8 CRD 4x4
df$VEHICLE_TYPE <- gsub(".+ ([0-9.]+[a-z]?(?: [^ ]+)?(?: [^ ]+)?)$", "\1", df$VEHICLE_TYPE)
# OR, simply grep "number" and everything after
# df$VEHICLE_TYPE <- gsub(".+ ([0-9.]+[a-z]? .+)$", "\1", df$VEHICLE_TYPE)
> df
# MODEL VEHICLE_TYPE
# 1 Bora 1.6
# 2 Ducato 120 Multijet
# 3 Ducato 120 Multijet
# 4 Astra 1.7 CDTI
# 5 406 2.0 HDi
# 6 406 2.0 HDi
# 7 Focus C-MAX 1.6 TDCi
# 8 Focus C-MAX 1.6 TDCi
# 9 Focus C-MAX 1.6 TDCi
# 10 Focus C-MAX 1.6 TDCi
# 11 Focus C-MAX 1.6 TDCi
# 12 S-Klasse 320 CDI
# 13 S-Klasse 320 CDI
# 14 S-Klasse 320 CDI
# 15 308 1.6i Flex
# 16 Cherokee 2.8 CRD 4x4
在Oracle中,可以使用正则表达式^(.*?)\s+(\d.*)$
:
中的第一个和第二个匹配组
SELECT REGEXP_SUBSTR( vehicle_type, '^(.*?)\s+(\d.*)$', 1, 1, NULL, 1 )
AS model,
REGEXP_SUBSTR( vehicle_type, '^(.*?)\s+(\d.*)$', 1, 1, NULL, 2 )
AS vehicle_type
FROM your_table;
我正在尝试按型号和引擎分隔我的列 VEHICLE_TYPE
。代码可以是普通的 SQL 或 R 代码。
我的数据是这样的:
MODEL VEHICLE_TYPE
77 Bora Bora 1.6
79 Ducato Ducato 15 120 Multijet
80 Ducato Ducato 15 120 Multijet
87 Astra Astra 1.7 CDTI
88 406 406 2.0 HDi
89 406 406 2.0 HDi
90 Focus C-MAX Focus C-MAX 1.6 TDCi
91 Focus C-MAX Focus C-MAX 1.6 TDCi
92 Focus C-MAX Focus C-MAX 1.6 TDCi
93 Focus C-MAX Focus C-MAX 1.6 TDCi
94 Focus C-MAX Focus C-MAX 1.6 TDCi
97 S-Klasse S 320 CDI
98 S-Klasse S 320 CDI
99 S-Klasse S 320 CDI
我想收到这样的东西:
MODEL VEHICLE TYPE
Bora 1.6
Ducato 15 120 Multijet
... ...
Focus C-Max 1.6 TDCi
问题是,VEHICLE_TYPE 可以有不同的长度和不同数量的空格,我可以用它们来分隔。
我用 gsub 和 regex 试过了,没用,但是 strsplit工作了。与我真正想要的相去甚远,我 运行 没有想法,现在需要一些帮助。
> strsplit(as.character(test$VEHICLE_TYPE)," ")
[[1]]
[1] "Bora" "1.6"
[[2]]
[1] "Ducato" "15" "120" "Multijet"
[[3]]
[1] "Ducato" "15" "120" "Multijet"
[[4]]
[1] "Astra" "1.7" "CDTI"
[[5]]
[1] "406" "2.0" "HDi"
[[6]]
[1] "406" "2.0" "HDi"
[[7]]
[1] "Focus" "C-MAX" "1.6" "TDCi"
[[8]]
[1] "Focus" "C-MAX" "1.6" "TDCi"
[[9]]
[1] "Focus" "C-MAX" "1.6" "TDCi"
[[10]]
[1] "Focus" "C-MAX" "1.6" "TDCi"
[[11]]
[1] "Focus" "C-MAX" "1.6" "TDCi"
[[12]]
[1] "S" "320" "CDI"
我猜有人会知道用比这更简单的正则表达式来做到这一点的方法,但由于我是正则表达式笨蛋,所以这是我的尝试。按 space 拆分,然后折叠第一个 "numeric" 值之后的所有内容。
library( magrittr )
df[['VEHICLE_TYPE']] %<>%
strsplit( " " ) %>%
sapply( function(x) paste(
x[ grep( "[[:digit:]]", x )[1] : length(x) ],
collapse = " " )
)
结果
> df
# # A tibble: 14 × 2
# MODEL VEHICLE_TYPE
# <chr> <chr>
# 1 Bora 1.6
# 2 Ducato 15 120 Multijet
# 3 Ducato 15 120 Multijet
# 4 Astra 1.7 CDTI
# 5 406 406 2.0 HDi
# 6 406 406 2.0 HDi
# 7 Focus C-MAX 1.6 TDCi
# 8 Focus C-MAX 1.6 TDCi
# 9 Focus C-MAX 1.6 TDCi
# 10 Focus C-MAX 1.6 TDCi
# 11 Focus C-MAX 1.6 TDCi
# 12 S-Klasse 320 CDI
# 13 S-Klasse 320 CDI
# 14 S-Klasse 320 CDI
或者,如果您更喜欢按 last 数值而不是 first 进行拆分:
df[['VEHICLE_TYPE']] %<>%
strsplit( " " ) %>%
sapply( function(x) paste(
x[ tail( grep( "[[:digit:]]", x ), 1 ) : length(x) ],
collapse = " " )
)
> df
# # A tibble: 14 × 2
# MODEL VEHICLE_TYPE
# <chr> <chr>
# 1 Bora 1.6
# 2 Ducato 120 Multijet
# 3 Ducato 120 Multijet
# 4 Astra 1.7 CDTI
# 5 406 2.0 HDi
# 6 406 2.0 HDi
# 7 Focus C-MAX 1.6 TDCi
# 8 Focus C-MAX 1.6 TDCi
# 9 Focus C-MAX 1.6 TDCi
# 10 Focus C-MAX 1.6 TDCi
# 11 Focus C-MAX 1.6 TDCi
# 12 S-Klasse 320 CDI
# 13 S-Klasse 320 CDI
# 14 S-Klasse 320 CDI
编辑:如果您有一些行没有任何数值,您可能需要一些额外的修补:
df[['VEHICLE_TYPE']] %<>%
strsplit( " " ) %>%
sapply( function(x) paste(
if( length( grep( "[[:digit:]]", x ) ) > 1L ) {
x[ tail( grep( "[[:digit:]]", x ), 1 ) : length(x) ]
} else { x },
collapse = " " )
)
正则表达式示例
with s(id,model,type) as (
select 77,'Bora','Bora 1.6' from dual union all
select 79,'Ducato','Ducato 15 120 Multijet' from dual union all
select 80 ,'Ducato','Ducato 15 120 Multijet' from dual union all
select 87 ,'Astra','Astra 1.7 CDTI' from dual union all
select 88 ,'406','406 2.0 HDi' from dual union all
select 89 ,'406','406 2.0 HDi' from dual union all
select 90 ,'Focus C-MAX','Focus C-MAX 1.6 TDCi' from dual union all
select 91 ,'Focus C-MAX','Focus C-MAX 1.6 TDCi' from dual union all
select 92 ,'Focus C-MAX','Focus C-MAX 1.6 TDCi' from dual union all
select 93 ,'Focus C-MAX','Focus C-MAX 1.6 TDCi' from dual union all
select 94 ,'Focus C-MAX','Focus C-MAX 1.6 TDCi' from dual union all
select 97 ,'S-Klasse','S 320 CDI' from dual union all
select 98 ,'S-Klasse','S 320 CDI' from dual union all
select 99 ,'S-Klasse','S 320 CDI' from dual
)
select regexp_substr(type,'\d+(\.\d+)?\s*\w*$') /*cut part with model*/
from s
这是使用 gsub
的替代解决方案df$VEHICLE_TYPE <- gsub(".+ ([0-9.]+(?: [^ ]+)?)$", "\1", df$VEHICLE_TYPE)
> df
# MODEL VEHICLE_TYPE
# 1 Bora 1.6
# 2 Ducato 120 Multijet
# 3 Ducato 120 Multijet
# 4 Astra 1.7 CDTI
# 5 406 2.0 HDi
# 6 406 2.0 HDi
# 7 Focus C-MAX 1.6 TDCi
# 8 Focus C-MAX 1.6 TDCi
# 9 Focus C-MAX 1.6 TDCi
# 10 Focus C-MAX 1.6 TDCi
# 11 Focus C-MAX 1.6 TDCi
# 12 S-Klasse 320 CDI
# 13 S-Klasse 320 CDI
# 14 S-Klasse 320 CDI
我假设车辆类型总是在末尾并遵循以下模式:(1) 一组数字字符(0 到 9 和点),例如1.6
或 (2) 组数字字符和组任何其他字符的组合,由 space 分隔(例如 120 Multijet
、2.0 HDi
)
更新 : 处理 308 1.6i Flex 和 Cherokee 2.8 CRD 4x4
df$VEHICLE_TYPE <- gsub(".+ ([0-9.]+[a-z]?(?: [^ ]+)?(?: [^ ]+)?)$", "\1", df$VEHICLE_TYPE)
# OR, simply grep "number" and everything after
# df$VEHICLE_TYPE <- gsub(".+ ([0-9.]+[a-z]? .+)$", "\1", df$VEHICLE_TYPE)
> df
# MODEL VEHICLE_TYPE
# 1 Bora 1.6
# 2 Ducato 120 Multijet
# 3 Ducato 120 Multijet
# 4 Astra 1.7 CDTI
# 5 406 2.0 HDi
# 6 406 2.0 HDi
# 7 Focus C-MAX 1.6 TDCi
# 8 Focus C-MAX 1.6 TDCi
# 9 Focus C-MAX 1.6 TDCi
# 10 Focus C-MAX 1.6 TDCi
# 11 Focus C-MAX 1.6 TDCi
# 12 S-Klasse 320 CDI
# 13 S-Klasse 320 CDI
# 14 S-Klasse 320 CDI
# 15 308 1.6i Flex
# 16 Cherokee 2.8 CRD 4x4
在Oracle中,可以使用正则表达式^(.*?)\s+(\d.*)$
:
SELECT REGEXP_SUBSTR( vehicle_type, '^(.*?)\s+(\d.*)$', 1, 1, NULL, 1 )
AS model,
REGEXP_SUBSTR( vehicle_type, '^(.*?)\s+(\d.*)$', 1, 1, NULL, 2 )
AS vehicle_type
FROM your_table;