使用 getURI Asynchronous() 抓取多个网页

Question

我是 R 的新手。我正在尝试使用 RCurl 包中的 getURIAsynchronous() 函数抓取多个 https 网页。但是，对于每个 url，该函数都返回“”作为结果。

我尝试使用同一个包中的 url.exists() 函数来查看它 returns 是真还是假。令我惊讶的是，它返回的值为 FALSE。但是 url 存在。

由于我使用的这些 https urls 是我公司特定的 urls，出于保密原因，我无法在此处提供示例。但是，使用 readLines() 成功地从网站中提取了所有 html 内容。但这对于数千 urls 来说是缓慢且耗时的。知道为什么 getURIAsynchronous() 返回 "" 而不是抓取 html 内容吗？我这里的重点是只抓取整个 html 内容，我可以自己解析数据。

有没有其他软件包可以帮助我更快地抓取多个 https 网站而不是一次抓取一个页面？

更新：下面是一个类似于我一直尝试做的小例子。在这种情况下，它只有几个 url 可以抓取，但在我的项目中，我有几千个。当我尝试使用下面的类似代码提取文本时，我得到所有 urls.

的“”

图书馆(RCurl)

source_url <- c("https://cran.r-project.org/web/packages/RCurl/index.html", "https://cran.r-project.org/web/packages/rvest/index.html")

multi_urls <- getURIAsynchronous(source_url) multi_urls <- as.list(multi_urls)

Answer 1

我不知道您要从哪个特定 URL 中抓取，但下面的代码将演示如何循环访问多个 URL，并从每个 URL 中抓取数据。也许您可以利用此代码来实现您的特定目标。

library(rvest)
library(stringr)

#create a master dataframe to store all of the results
complete <- data.frame()

yearsVector <- c("2010", "2011", "2012", "2013", "2014", "2015")
#position is not needed since all of the info is stored on the page
#positionVector <- c("qb", "rb", "wr", "te", "ol", "dl", "lb", "cb", "s")
positionVector <- c("qb")
for (i in 1:length(yearsVector)) {
    for (j in 1:length(positionVector)) {
        # create a url template 
        URL.base <- "http://www.nfl.com/draft/"
        URL.intermediate <- "/tracker?icampaign=draft-sub_nav_bar-drafteventpage-tracker#dt-tabs:dt-by-position/dt-by-position-input:"
        #create the dataframe with the dynamic values
        URL <- paste0(URL.base, yearsVector[i], URL.intermediate, positionVector[j])
        #print(URL)

        #read the page - store the page to make debugging easier
        page <- read_html(URL)

        #find records for each player
        playersloc <- str_locate_all(page, "\{\"personId.*?\}")[[1]]
        # Select the first column [, 1] and select the second column [, 2]
        players <- str_sub(page, playersloc[, 1] + 1, playersloc[, 2] - 1)
        #fix the cases where the players are named Jr.
        players <- gsub(", ", "_", players)

        #split and reshape the data in a data frame
        play2 <- strsplit(gsub("\"", "", players), ',')
        data <- sapply(strsplit(unlist(play2), ":"), FUN = function(x) { x[2] })
        df <- data.frame(matrix(data, ncol = 16, byrow = TRUE))
        #name the column names
        names(df) <- sapply(strsplit(unlist(play2[1]), ":"), FUN = function(x) { x[1] })


        #store the temp values into the master dataframe
        complete <- rbind(complete, df)
    }
}

还有。 . .

library(rvest)
library(stringr)
library(tidyr)

site <- 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0' 

webpage <- read_html(site)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]


jump <- seq(0, 800, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?',
              'request=1&year_min=2001&year_max=2014&round_min=&round_max=',
              '&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0',
              '&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y',
              '&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=',
              '&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id',
              '&order_by_asc=&offset=', jump, sep="")

dfList <- lapply(site, function(i) {
    webpage <- read_html(i)
    draft_table <- html_nodes(webpage, 'table')
    draft <- html_table(draft_table)[[1]]
})

finaldf <- do.call(rbind, dfList)

使用 getURI Asynchronous() 抓取多个网页

Scraping multiple webpages using getURIAsynchronous()

r

web-scraping

rcurl