如何在多个条件下解析 R 中 url 字符串的键值对
How to parse key value pair of url string in R with multiple conditions
我有一个格式如下的字符串:
a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire),
StartDate=2015-05-20, EndDate=2015-05-20, performance=best")
我的目标是在如下数据框中获得最终结果:
first_name cust_id start_date end_date performance cust_notes
James(Mr) 98503 2015-05-20 2015-05-20 best ZZW_LG,WGE,zonaire
我运行下面的代码:
a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire),
StartDate=2015-05-20, EndDate=2015-05-20, performance=best")
split_by_comma <- strsplit(a,",")
split_by_equal <- lapply(split_by_comma,strsplit,"=")
由于 custid 有额外的逗号和括号,我没有得到想要的结果。
请注意,名字中的括号是真实的,需要原样。
你需要按这个拆分。
,(?![^()]*\))
您需要 lookahead
。这不会在 ()
内被 ,
拆分。请参见演示。
https://regex101.com/r/uF4oY4/82
要获得想要的结果,请使用
split_by_comma <- strsplit(a,",(?![^()]*\))",perl=TRUE)
split_by_equal <- lapply(split_by_comma,strsplit,"=")
如果您的字符串格式成立,这可能是一个快速的解决方案:
library(httr)
a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire), StartDate=2015-05-20,
EndDate=2015-05-20, performance=best")
dat <- data.frame(parse_url(sprintf("?%s", gsub(",[[:space:]]+", "&", a)))$query,
stringsAsFactors=FALSE)
library(tidyr)
library(dplyr)
mutate(separate(dat, cust_id, into=c("cust_id", "cust_notes"), sep="\("),
cust_notes=gsub("\)", "", cust_notes))
## first_name cust_id cust_notes StartDate EndDate performance
## 1 James(Mr) 98503 ZZW_LG,WGE,zonaire 2015-05-20 2015-05-20 best
外推法:
gsub(",[[:space:]]+", "&", a)
使参数看起来像 URL 查询字符串的组成部分。
sprintf(…)
让它看起来像一个实际的查询字符串
parse_url
(来自 httr
)将分离出 key/value 对并将它们粘贴到返回列表中的列表(名为 query
)中
data.frame
会的,嗯……
separate
会将 (
处的 cust_id
列拆分为两列
mutate
将删除新 cust_notes
列中的 )
这里是整个 "pipe":
library(httr)
library(tidyr)
library(dplyr)
library(magrittr)
a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire), StartDate=2015-05-20,
EndDate=2015-05-20, performance=best")
a %>%
gsub(",[[:space:]]+", "&", .) %>%
sprintf("?%s", .) %>%
parse_url() %>%
extract2("query") %>%
data.frame(stringsAsFactors=FALSE) %>%
separate(cust_id, into=c("cust_id", "cust_notes"), sep="\(") %>%
mutate(cust_notes=gsub("\)", "", cust_notes))
与外推相符并且 (IMO) 更容易理解。
回复晚了,但是贴出来了,因为它很容易理解和实现,不需要使用任何额外的包
rawdf = read.csv("<your file path>", header = F, sep = ",", stringsAsFactors = F)
# Get the first row of the dataframe and transpose it into a column of a df
colnames = data.frame(t(rawdf[1,]))
# Split the values of the single column df created above into its key value
# pairs which are separated by '=' and save in a vector
colnames = unlist(strsplit(as.character(colnames$X1), "="))
# Pick up all the odd indexed values from the above vector (all odd places
# are colnames and even places the values associated with them)
colnames = colnames[seq(1,length(colnames),2)]
# Assign the extracted column names from the vector above to your original data frame
colnames(rawdf) = colnames
# Use the regex to extract the value in each field of the original df by
# replacing the 'Key=' pattern present in each field with an empty string
for(i in 1:dim(rawdf)[2]) rawdf[,i] = gsub(paste(colnames[i],"=",sep=""), "", rawdf[,i])
我有一个格式如下的字符串:
a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire),
StartDate=2015-05-20, EndDate=2015-05-20, performance=best")
我的目标是在如下数据框中获得最终结果:
first_name cust_id start_date end_date performance cust_notes
James(Mr) 98503 2015-05-20 2015-05-20 best ZZW_LG,WGE,zonaire
我运行下面的代码:
a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire),
StartDate=2015-05-20, EndDate=2015-05-20, performance=best")
split_by_comma <- strsplit(a,",")
split_by_equal <- lapply(split_by_comma,strsplit,"=")
由于 custid 有额外的逗号和括号,我没有得到想要的结果。
请注意,名字中的括号是真实的,需要原样。
你需要按这个拆分。
,(?![^()]*\))
您需要 lookahead
。这不会在 ()
内被 ,
拆分。请参见演示。
https://regex101.com/r/uF4oY4/82
要获得想要的结果,请使用
split_by_comma <- strsplit(a,",(?![^()]*\))",perl=TRUE)
split_by_equal <- lapply(split_by_comma,strsplit,"=")
如果您的字符串格式成立,这可能是一个快速的解决方案:
library(httr)
a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire), StartDate=2015-05-20,
EndDate=2015-05-20, performance=best")
dat <- data.frame(parse_url(sprintf("?%s", gsub(",[[:space:]]+", "&", a)))$query,
stringsAsFactors=FALSE)
library(tidyr)
library(dplyr)
mutate(separate(dat, cust_id, into=c("cust_id", "cust_notes"), sep="\("),
cust_notes=gsub("\)", "", cust_notes))
## first_name cust_id cust_notes StartDate EndDate performance
## 1 James(Mr) 98503 ZZW_LG,WGE,zonaire 2015-05-20 2015-05-20 best
外推法:
gsub(",[[:space:]]+", "&", a)
使参数看起来像 URL 查询字符串的组成部分。sprintf(…)
让它看起来像一个实际的查询字符串parse_url
(来自httr
)将分离出 key/value 对并将它们粘贴到返回列表中的列表(名为query
)中data.frame
会的,嗯……separate
会将(
处的cust_id
列拆分为两列mutate
将删除新cust_notes
列中的)
这里是整个 "pipe":
library(httr)
library(tidyr)
library(dplyr)
library(magrittr)
a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire), StartDate=2015-05-20,
EndDate=2015-05-20, performance=best")
a %>%
gsub(",[[:space:]]+", "&", .) %>%
sprintf("?%s", .) %>%
parse_url() %>%
extract2("query") %>%
data.frame(stringsAsFactors=FALSE) %>%
separate(cust_id, into=c("cust_id", "cust_notes"), sep="\(") %>%
mutate(cust_notes=gsub("\)", "", cust_notes))
与外推相符并且 (IMO) 更容易理解。
回复晚了,但是贴出来了,因为它很容易理解和实现,不需要使用任何额外的包
rawdf = read.csv("<your file path>", header = F, sep = ",", stringsAsFactors = F)
# Get the first row of the dataframe and transpose it into a column of a df
colnames = data.frame(t(rawdf[1,]))
# Split the values of the single column df created above into its key value
# pairs which are separated by '=' and save in a vector
colnames = unlist(strsplit(as.character(colnames$X1), "="))
# Pick up all the odd indexed values from the above vector (all odd places
# are colnames and even places the values associated with them)
colnames = colnames[seq(1,length(colnames),2)]
# Assign the extracted column names from the vector above to your original data frame
colnames(rawdf) = colnames
# Use the regex to extract the value in each field of the original df by
# replacing the 'Key=' pattern present in each field with an empty string
for(i in 1:dim(rawdf)[2]) rawdf[,i] = gsub(paste(colnames[i],"=",sep=""), "", rawdf[,i])