获取 URL 目录的自定义函数
Custom Function for obtaining URL directory
看起来很简单,
考虑以下 URLs,
[1] "scripts.iucr.org/cgi-bin/paper?S1600536812045886"
[2] "cpa-seoadvisors.com/cvv/auth/auth/view/pdf/index.html/"
[3] "www.scirp.org/journal/PaperDownload.aspx?DOI=10.4236/csta.2012.13014"
[4] "www.google.com.cy/search?q=DNS+traffic&es_..."
[5] "seesaa.net/pede/lobortis/ligula/sit/amet.png?semper=vitae&est=..."
我想得到第一个 '/'
和用 ?
分隔令牌的部分。
我写了下面的函数
get_directory <- function(x){
dir <- sapply(strsplit(x, '/'), function(i)sum(grepl('\?', i)))
ifelse(dir > 0, sapply(strsplit(x, '/'), function(i) paste(i[-c(1, length(i))], collapse = '/')), 0)
}
但它在 [3] 和 [4] URL 失败了。
预期输出应该是
"cgi-bin"
"0"
"journal"
"0"
"pede/lobortis/liguls/sit"
数据
dput(df)
structure(list(V1 = c("scripts.iucr.org/cgi-bin/paper?S1600536812045886",
"cpa-seoadvisors.com/cvv/auth/auth/view/pdf/index.html/", "www.scirp.org/journal/PaperDownload.aspx?DOI=10.4236/csta.2012.13014",
"www.google.com.cy/search?q=DNS+traffic&es_...", "seesaa.net/pede/lobortis/ligula/sit/amet.png?semper=vitae&est=..."
)), .Names = "V1", row.names = c(NA, -5L), class = "data.frame")
我们可以使用str_extract
。使用正则表达式环视,我们匹配一个或多个字符 (.*
) 在 /
后跟一个 /
和一个或多个不是 ?
的字符 ([^?]+
) 后跟 ?
。
library(stringr)
res <- str_extract(df$V1, "(?<=\/).*(?=\/[^?]+[?])")
res[is.na(res)] <- 0
res
#[1] "cgi-bin" "0" "journal"
#[4] "0" "pede/lobortis/ligula/sit"
看起来很简单,
考虑以下 URLs,
[1] "scripts.iucr.org/cgi-bin/paper?S1600536812045886"
[2] "cpa-seoadvisors.com/cvv/auth/auth/view/pdf/index.html/"
[3] "www.scirp.org/journal/PaperDownload.aspx?DOI=10.4236/csta.2012.13014"
[4] "www.google.com.cy/search?q=DNS+traffic&es_..."
[5] "seesaa.net/pede/lobortis/ligula/sit/amet.png?semper=vitae&est=..."
我想得到第一个 '/'
和用 ?
分隔令牌的部分。
我写了下面的函数
get_directory <- function(x){
dir <- sapply(strsplit(x, '/'), function(i)sum(grepl('\?', i)))
ifelse(dir > 0, sapply(strsplit(x, '/'), function(i) paste(i[-c(1, length(i))], collapse = '/')), 0)
}
但它在 [3] 和 [4] URL 失败了。
预期输出应该是
"cgi-bin"
"0"
"journal"
"0"
"pede/lobortis/liguls/sit"
数据
dput(df)
structure(list(V1 = c("scripts.iucr.org/cgi-bin/paper?S1600536812045886",
"cpa-seoadvisors.com/cvv/auth/auth/view/pdf/index.html/", "www.scirp.org/journal/PaperDownload.aspx?DOI=10.4236/csta.2012.13014",
"www.google.com.cy/search?q=DNS+traffic&es_...", "seesaa.net/pede/lobortis/ligula/sit/amet.png?semper=vitae&est=..."
)), .Names = "V1", row.names = c(NA, -5L), class = "data.frame")
我们可以使用str_extract
。使用正则表达式环视,我们匹配一个或多个字符 (.*
) 在 /
后跟一个 /
和一个或多个不是 ?
的字符 ([^?]+
) 后跟 ?
。
library(stringr)
res <- str_extract(df$V1, "(?<=\/).*(?=\/[^?]+[?])")
res[is.na(res)] <- 0
res
#[1] "cgi-bin" "0" "journal"
#[4] "0" "pede/lobortis/ligula/sit"