如何使用 R 检查字符串是否是维基百科文章的标题?
How to check if a string is the title of a Wikipedia article with R?
假设我有一个字符串列表:
strings <- c("dog", "cat", "animal", "bird", "birds", "bqpohd", "ohphha", "mqphihpha", "aphhphohpa", "pohha")
我想检查一下这些字符串是否是维基百科文章的标题。
这是一个解决方案,但我认为这不是完成长列表任务的最快方法:
results.df <- CheckIfAStringIsTheTitleOfAWikipediaArticle(strings)
View(results.df)
CheckIfAStringIsTheTitleOfAWikipediaArticle <- function(strings){
start_time <- Sys.time()
Check <- function(string){
GetPageID <- function(string){
query <- paste0("https://en.wikipedia.org/w/api.php?",
"action=query", "&format=xml", "&titles=", string)
answer <- httr::GET(query)
library(xml2)
library(httr)
page.xml <- xml2::read_xml(answer)
nodes <- xml_find_all(page.xml, ".//query//pages//page")
pageid <- xml_attr(nodes, "pageid", ns = character(),
default = NA_character_)
return(pageid)
}
IsValidPageName <- function(string){
pageid<- GetPageID(string)
if(!is.na(pageid)){return(TRUE)}
else{return(FALSE)}
}
boolean <- IsValidPageName(string)
return(boolean)
}
validTitle <- unlist(lapply(strings, Check))
results.df <- data.frame(strings, validTitle)
end_time <- Sys.time()
time <- end_time - start_time
print(time)
return(results.df)
}
非常感谢您的帮助!
这是基本的 R 方法。
将所有 English titles from Wikipedia 下载到临时文件中。然后将它们扫描到内存中。大约 1.2 Gb。
我假设您不关心大小写,因此我们需要使用 tolower
将标题更改为全部小写。然后只需使用 %in%
.
strings <- c("dog", "cat", "animal", "bird", "birds", "bqpohd", "ohphha", "mqphihpha", "aphhphohpa", "pohha")
url <- "http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz"
tmp <- tempfile()
download.file(url,tmp)
titles <- scan(gzfile(tmp),character())
titles <- tolower(titles)
strings[strings %in% titles]
[1] "dog" "cat" "animal" "bird" "birds"
#Reasonably fast
system.time(strings[strings %in% titles])
user system elapsed
1.494 0.029 1.525
假设我有一个字符串列表:
strings <- c("dog", "cat", "animal", "bird", "birds", "bqpohd", "ohphha", "mqphihpha", "aphhphohpa", "pohha")
我想检查一下这些字符串是否是维基百科文章的标题。
这是一个解决方案,但我认为这不是完成长列表任务的最快方法:
results.df <- CheckIfAStringIsTheTitleOfAWikipediaArticle(strings)
View(results.df)
CheckIfAStringIsTheTitleOfAWikipediaArticle <- function(strings){
start_time <- Sys.time()
Check <- function(string){
GetPageID <- function(string){
query <- paste0("https://en.wikipedia.org/w/api.php?",
"action=query", "&format=xml", "&titles=", string)
answer <- httr::GET(query)
library(xml2)
library(httr)
page.xml <- xml2::read_xml(answer)
nodes <- xml_find_all(page.xml, ".//query//pages//page")
pageid <- xml_attr(nodes, "pageid", ns = character(),
default = NA_character_)
return(pageid)
}
IsValidPageName <- function(string){
pageid<- GetPageID(string)
if(!is.na(pageid)){return(TRUE)}
else{return(FALSE)}
}
boolean <- IsValidPageName(string)
return(boolean)
}
validTitle <- unlist(lapply(strings, Check))
results.df <- data.frame(strings, validTitle)
end_time <- Sys.time()
time <- end_time - start_time
print(time)
return(results.df)
}
非常感谢您的帮助!
这是基本的 R 方法。
将所有 English titles from Wikipedia 下载到临时文件中。然后将它们扫描到内存中。大约 1.2 Gb。
我假设您不关心大小写,因此我们需要使用 tolower
将标题更改为全部小写。然后只需使用 %in%
.
strings <- c("dog", "cat", "animal", "bird", "birds", "bqpohd", "ohphha", "mqphihpha", "aphhphohpa", "pohha")
url <- "http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz"
tmp <- tempfile()
download.file(url,tmp)
titles <- scan(gzfile(tmp),character())
titles <- tolower(titles)
strings[strings %in% titles]
[1] "dog" "cat" "animal" "bird" "birds"
#Reasonably fast
system.time(strings[strings %in% titles])
user system elapsed
1.494 0.029 1.525