在忽略 URL 字符串末尾的一些标点符号的 R 正则表达式中
In R Regex that ignores some punctuation at the end of a URL string
是否可以使用 Regex 函数忽略 URL 字符串末尾的一些标点符号(不是“/
's”)(即 url 字符串后跟 space) 提取时?提取 URLs 时,我在提取的字符串末尾得到句号、括号、问号和感叹号,例如:
findURL <- function(x){
m <- gregexpr("http[^[:space:]]+", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
x <- "find out more at http://bit.ly/SS/VUEr). check it out here http://bit.ly/14pwinr)? http://bit.ly/108vJOM! Now!"
findURL(x)
[1] http://bit.ly/SS/VUEr).http://bit.ly/14pwinr)? http://bit.ly/108vJOM!
和
findURL2 <- function(x){
m <- gregexpr("www[^[:space:]]+", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
y <- "This is an www.example.com/store/locator. of the type of www.example.com/Google/Voice. data I'd like to extract www.example.com/network. get it?"
findURL2(y)
[1] www.example.com/store/locator. www.example.com/Google/Voice. www.example.com/network.
有没有办法修改这些函数,以便 . ) ?
!
或 ,
或(如果可能)). )? )!
或 ),
在字符串的末尾找到后跟 space(即如果标点符号:句点、括号、问号、感叹号或逗号在 URL 字符串的末尾后跟 space) 不提取它们?
使用积极的前瞻性,您也可以将两者结合起来...
findURL <- function(x){
m <- gregexpr("\b(?:www|http)[^[:space:]]+?(?=[^\s\w]*(?:\s|$))", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
x <- "find out more at http://bit.ly/SS/VUEr). check it out here http://bit.ly/14pwinr)? http://bit.ly/108vJOM! Now!"
y <- "This is an www.example.com/store/locator. of the type of www.example.com/Google/Voice. data I'd like to extract www.example.com/network. get it?"
findURL(x)
findURL(y)
# [1] "http://bit.ly/SS/VUEr http://bit.ly/14pwinr http://bit.ly/108vJOM"
# [1] "www.example.com/store/locator www.example.com/Google/Voice www.example.com/network"
是否可以使用 Regex 函数忽略 URL 字符串末尾的一些标点符号(不是“/
's”)(即 url 字符串后跟 space) 提取时?提取 URLs 时,我在提取的字符串末尾得到句号、括号、问号和感叹号,例如:
findURL <- function(x){
m <- gregexpr("http[^[:space:]]+", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
x <- "find out more at http://bit.ly/SS/VUEr). check it out here http://bit.ly/14pwinr)? http://bit.ly/108vJOM! Now!"
findURL(x)
[1] http://bit.ly/SS/VUEr).http://bit.ly/14pwinr)? http://bit.ly/108vJOM!
和
findURL2 <- function(x){
m <- gregexpr("www[^[:space:]]+", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
y <- "This is an www.example.com/store/locator. of the type of www.example.com/Google/Voice. data I'd like to extract www.example.com/network. get it?"
findURL2(y)
[1] www.example.com/store/locator. www.example.com/Google/Voice. www.example.com/network.
有没有办法修改这些函数,以便 . ) ?
!
或 ,
或(如果可能)). )? )!
或 ),
在字符串的末尾找到后跟 space(即如果标点符号:句点、括号、问号、感叹号或逗号在 URL 字符串的末尾后跟 space) 不提取它们?
使用积极的前瞻性,您也可以将两者结合起来...
findURL <- function(x){
m <- gregexpr("\b(?:www|http)[^[:space:]]+?(?=[^\s\w]*(?:\s|$))", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
x <- "find out more at http://bit.ly/SS/VUEr). check it out here http://bit.ly/14pwinr)? http://bit.ly/108vJOM! Now!"
y <- "This is an www.example.com/store/locator. of the type of www.example.com/Google/Voice. data I'd like to extract www.example.com/network. get it?"
findURL(x)
findURL(y)
# [1] "http://bit.ly/SS/VUEr http://bit.ly/14pwinr http://bit.ly/108vJOM"
# [1] "www.example.com/store/locator www.example.com/Google/Voice www.example.com/network"