匹配并提取 r 中的子字符串
match and extract substrings in r
我有一个字符一行一行的文本数据,里面全是字符串
[1]"1128=9,9=282,35=X,34=4846318,52=20140107224500037,34=20140107,268=3,279=0,22=8,48=637548,83=585590,107=ZCH4,269=4,270=425,273=224500000,286=5,279=0,22=8,48=637548,83=585591,107=ZCH4,269=E,273=425.5,273=224500000,279=0,273=8,48=637548,34=585592,107=ZCH4,269=F,270=425,271=100,273=224500000,10=144"
[2]"1128=9,9=467,35=X,34=4846344,52=20140107224500107,75=20140108,268=5,279=0,22=8,48=772825,279=0,22=8,48=692825,83=434250,107=ZCZ4,269=E,270=453,271=41,273=224500000,279=0,22=8,48=692007,83=434251,107=ZCZ4,269=F,270=452.75,273=224500000,279=0,22=8,48=35213,83=434252274=2,336=0,451=0.25,279=1,22=8,48=692825,83=434253,107=ZCZ4,269=1,270=453,271=51,273=224500000,336=0,346=17,1023=1,10=239"
我想截断数据,只提取以“48=”和“34=”开头的子字符串,
我当前的代码是:
ex_between(data, c('48=', '34='), c(',', ','), extract=TRUE)
它有效,但它也截断了我想保留的“48=”和“34=”部分。
期望的结果:
[1]"34=4846318,34=20140107,48=637548,48=637548,48=637548,34=585592"
[2]34=4846344,48=772825,48=692825,48=692007,48=35213,48=692825"
截断数据中元素“34=....”和“48=....”的顺序需要与原始数据中的顺序相同。
怎么样:
# Sample strings
x <- c("1128=9,9=282,35=X,34=4846318,52=20140107224500037,34=20140107,268=3,279=0,22=8,48=637548,83=585590,107=ZCH4,269=4,270=425,273=224500000,286=5,279=0,22=8,48=637548,83=585591,107=ZCH4,269=E,273=425.5,273=224500000,279=0,273=8,48=637548,34=585592,107=ZCH4,269=F,270=425,271=100,273=224500000,10=144",
"1128=9,9=467,35=X,34=4846344,52=20140107224500107,75=20140108,268=5,279=0,22=8,48=772825,279=0,22=8,48=692825,83=434250,107=ZCZ4,269=E,270=453,271=41,273=224500000,279=0,22=8,48=692007,83=434251,107=ZCZ4,269=F,270=452.75,273=224500000,279=0,22=8,48=35213,83=434252274=2,336=0,451=0.25,279=1,22=8,48=692825,83=434253,107=ZCZ4,269=1,270=453,271=51,273=224500000,336=0,346=17,1023=1,10=239")
unlist(lapply(strsplit(x, ","), function(x)
paste(x[grep("(48=\d+|34=\d+)", x)], collapse = ",")));
#[1] "34=4846318,34=20140107,48=637548,48=637548,48=637548,34=585592"
#[2] "34=4846344,48=772825,48=692825,48=692007,48=35213,48=692825"
您还可以使用 (?<=,|^)(?:48|34)=[^,]*
等 PCRE 正则表达式提取所需的值,然后 sapply
找到的匹配项 collapse
使用 ,
构建最终的结果:
x <- c("1128=9,9=282,35=X,34=4846318,52=20140107224500037,34=20140107,268=3,279=0,22=8,48=637548,83=585590,107=ZCH4,269=4,270=425,273=224500000,286=5,279=0,22=8,48=637548,83=585591,107=ZCH4,269=E,273=425.5,273=224500000,279=0,273=8,48=637548,34=585592,107=ZCH4,269=F,270=425,271=100,273=224500000,10=144", "1128=9,9=467,35=X,34=4846344,52=20140107224500107,75=20140108,268=5,279=0,22=8,48=772825,279=0,22=8,48=692825,83=434250,107=ZCZ4,269=E,270=453,271=41,273=224500000,279=0,22=8,48=692007,83=434251,107=ZCZ4,269=F,270=452.75,273=224500000,279=0,22=8,48=35213,83=434252274=2,336=0,451=0.25,279=1,22=8,48=692825,83=434253,107=ZCZ4,269=1,270=453,271=51,273=224500000,336=0,346=17,1023=1,10=239")
m <- regmatches(x, gregexpr("(?<=,|^)(?:48|34)=[^,]*", x, perl=TRUE))
sapply(m, function(x) paste(x, collapse=","))
# => [1] "34=4846318,34=20140107,48=637548,48=637548,48=637548,34=585592"
# => [2] "34=4846344,48=772825,48=692825,48=692007,48=35213,48=692825"
图案详情
(?<=,|^)
- 在当前位置的左侧必须有一个 ,
或字符串的开头(这是一个积极的后视结构,这就是为什么 perl=TRUE
需要gregexpr
将提取输入中的所有匹配项)
(?:48|34)
- 48
或 34
=
- 等号
[^,]*
- ,
. 以外的 0+ 个字符
我有一个字符一行一行的文本数据,里面全是字符串
[1]"1128=9,9=282,35=X,34=4846318,52=20140107224500037,34=20140107,268=3,279=0,22=8,48=637548,83=585590,107=ZCH4,269=4,270=425,273=224500000,286=5,279=0,22=8,48=637548,83=585591,107=ZCH4,269=E,273=425.5,273=224500000,279=0,273=8,48=637548,34=585592,107=ZCH4,269=F,270=425,271=100,273=224500000,10=144"
[2]"1128=9,9=467,35=X,34=4846344,52=20140107224500107,75=20140108,268=5,279=0,22=8,48=772825,279=0,22=8,48=692825,83=434250,107=ZCZ4,269=E,270=453,271=41,273=224500000,279=0,22=8,48=692007,83=434251,107=ZCZ4,269=F,270=452.75,273=224500000,279=0,22=8,48=35213,83=434252274=2,336=0,451=0.25,279=1,22=8,48=692825,83=434253,107=ZCZ4,269=1,270=453,271=51,273=224500000,336=0,346=17,1023=1,10=239"
我想截断数据,只提取以“48=”和“34=”开头的子字符串,
我当前的代码是:
ex_between(data, c('48=', '34='), c(',', ','), extract=TRUE)
它有效,但它也截断了我想保留的“48=”和“34=”部分。
期望的结果:
[1]"34=4846318,34=20140107,48=637548,48=637548,48=637548,34=585592"
[2]34=4846344,48=772825,48=692825,48=692007,48=35213,48=692825"
截断数据中元素“34=....”和“48=....”的顺序需要与原始数据中的顺序相同。
怎么样:
# Sample strings
x <- c("1128=9,9=282,35=X,34=4846318,52=20140107224500037,34=20140107,268=3,279=0,22=8,48=637548,83=585590,107=ZCH4,269=4,270=425,273=224500000,286=5,279=0,22=8,48=637548,83=585591,107=ZCH4,269=E,273=425.5,273=224500000,279=0,273=8,48=637548,34=585592,107=ZCH4,269=F,270=425,271=100,273=224500000,10=144",
"1128=9,9=467,35=X,34=4846344,52=20140107224500107,75=20140108,268=5,279=0,22=8,48=772825,279=0,22=8,48=692825,83=434250,107=ZCZ4,269=E,270=453,271=41,273=224500000,279=0,22=8,48=692007,83=434251,107=ZCZ4,269=F,270=452.75,273=224500000,279=0,22=8,48=35213,83=434252274=2,336=0,451=0.25,279=1,22=8,48=692825,83=434253,107=ZCZ4,269=1,270=453,271=51,273=224500000,336=0,346=17,1023=1,10=239")
unlist(lapply(strsplit(x, ","), function(x)
paste(x[grep("(48=\d+|34=\d+)", x)], collapse = ",")));
#[1] "34=4846318,34=20140107,48=637548,48=637548,48=637548,34=585592"
#[2] "34=4846344,48=772825,48=692825,48=692007,48=35213,48=692825"
您还可以使用 (?<=,|^)(?:48|34)=[^,]*
等 PCRE 正则表达式提取所需的值,然后 sapply
找到的匹配项 collapse
使用 ,
构建最终的结果:
x <- c("1128=9,9=282,35=X,34=4846318,52=20140107224500037,34=20140107,268=3,279=0,22=8,48=637548,83=585590,107=ZCH4,269=4,270=425,273=224500000,286=5,279=0,22=8,48=637548,83=585591,107=ZCH4,269=E,273=425.5,273=224500000,279=0,273=8,48=637548,34=585592,107=ZCH4,269=F,270=425,271=100,273=224500000,10=144", "1128=9,9=467,35=X,34=4846344,52=20140107224500107,75=20140108,268=5,279=0,22=8,48=772825,279=0,22=8,48=692825,83=434250,107=ZCZ4,269=E,270=453,271=41,273=224500000,279=0,22=8,48=692007,83=434251,107=ZCZ4,269=F,270=452.75,273=224500000,279=0,22=8,48=35213,83=434252274=2,336=0,451=0.25,279=1,22=8,48=692825,83=434253,107=ZCZ4,269=1,270=453,271=51,273=224500000,336=0,346=17,1023=1,10=239")
m <- regmatches(x, gregexpr("(?<=,|^)(?:48|34)=[^,]*", x, perl=TRUE))
sapply(m, function(x) paste(x, collapse=","))
# => [1] "34=4846318,34=20140107,48=637548,48=637548,48=637548,34=585592"
# => [2] "34=4846344,48=772825,48=692825,48=692007,48=35213,48=692825"
图案详情
(?<=,|^)
- 在当前位置的左侧必须有一个,
或字符串的开头(这是一个积极的后视结构,这就是为什么perl=TRUE
需要gregexpr
将提取输入中的所有匹配项)(?:48|34)
-48
或34
=
- 等号[^,]*
-,
. 以外的 0+ 个字符