R Rvest for() and Error server error: (503) Service Unavailable

Question

我是网络抓取的新手，但我很高兴在 R 中使用 rvest。我试图用它来抓取公司的特定数据。我创建了一个 for 循环 (171 urls)，当我运行时，它在第 6 或第 7 url 停止并出现错误

Error in parse.response(r, parser, encoding = encoding) : 
  server error: (503) Service Unavailable

当我从第 7 个 url 开始循环时，它又进行了两到三个循环，然后再次停止并出现相同的错误。我的循环

library(rvest)    
thing<-c("http://www.informazione-aziende.it/Azienda_ LA-VIS-S-C-A",                                                                                  
    "http://www.informazione-aziende.it/Azienda_ L-ANGOLO-DEL-DOLCE-DI-OBEROSLER-MARCO",                                                         
    "http://www.informazione-aziende.it/Azienda_ MARCHI-LAURA",                                                                                 
    "http://www.informazione-aziende.it/Azienda_ LAVIS-PIZZA-DI-GASPARETTO-MATTEO",                                                              
    "http://www.informazione-aziende.it/Azienda_ LE-DELIZIE-MOCHENE-DI-OSLER-NICOLA",                                                            
    "http://www.informazione-aziende.it/Azienda_ LE-DELIZIE-S-N-C-DI-GAMBONI-PIETRO-E-PISONI-MAURO-C-IN-SIGLA-LE-DELIZIE-S-N-C",                 
    "http://www.informazione-aziende.it/Azienda_ LE-FONTI-DISTILLATI-DI-COVI-MARCELLO",                                                          
    "http://www.informazione-aziende.it/Azienda_ LE-MIGOLE-DI-MATTEOTTI-LUCA",                                                                   
    "http://www.informazione-aziende.it/Azienda_ LECHTHALER-DI-TOGN-LUIGI-E-C-S-N-C",                                                            
    "http://www.informazione-aziende.it/Azienda_ LETRARI-AZ-AGRICOLA")

    thing<-gsub(" ", "", thing)

    d <- matrix(nrow=10, ncol=4)
    colnames(d)<-c("RAGIONE SOCIALE",'ATTIVITA', 'INDIRIZZO', 'CAP')

    for(i in 1:10) {
            a<-thing[i]

            urls<-html(a)

            d[i,2] <- try({ urls %>% html_node(".span") %>% html_text() }, silent=TRUE)
    }

可能有办法避免此错误，在此先感谢您，我们将不胜感激。

更新使用下一个代码，我试图从上次成功的 repeat() 开始重新开始获取数据的循环，但它正在无限循环，希望得到一些建议。

    for(i in 1:10) {

  a<-thing[i]

  try({d[i,2]<- try({html(a) }, silent=TRUE)  %>%
         html_node(".span") %>%
         html_text() }, silent=TRUE)

  repeat {try({d[i,2]<- try({html(a) }, silent=TRUE)  %>%
                 html_node(".span") %>%
                 html_text() }, silent=TRUE)}
  if (!is.na(d[i,2])) break
}

或 while()

for(i in 1:10) {

  a<-thing[i]

while (is.na(d[i,2])) {
  try({d[i,2]<-try({html(a) %>%html_node(".span")},silent=TRUE) %>% html_text() },silent=TRUE)
}
}

While() 有效但不是很好而且太慢 ((

Answer 1

看起来如果你访问那个网站太快，你会得到一个 503。添加一个 Sys.sleep(2) 并且所有 10 次迭代都对我有效...

library(rvest)    
thing<-c("http://www.informazione-aziende.it/Azienda_ LA-VIS-S-C-A",                                                                                  
         "http://www.informazione-aziende.it/Azienda_ L-ANGOLO-DEL-DOLCE-DI-OBEROSLER-MARCO",                                                         
         "http://www.informazione-aziende.it/Azienda_ MARCHI-LAURA",                                                                                 
         "http://www.informazione-aziende.it/Azienda_ LAVIS-PIZZA-DI-GASPARETTO-MATTEO",                                                              
         "http://www.informazione-aziende.it/Azienda_ LE-DELIZIE-MOCHENE-DI-OSLER-NICOLA",                                                            
         "http://www.informazione-aziende.it/Azienda_ LE-DELIZIE-S-N-C-DI-GAMBONI-PIETRO-E-PISONI-MAURO-C-IN-SIGLA-LE-DELIZIE-S-N-C",                 
         "http://www.informazione-aziende.it/Azienda_ LE-FONTI-DISTILLATI-DI-COVI-MARCELLO",                                                          
         "http://www.informazione-aziende.it/Azienda_ LE-MIGOLE-DI-MATTEOTTI-LUCA",                                                                   
         "http://www.informazione-aziende.it/Azienda_ LECHTHALER-DI-TOGN-LUIGI-E-C-S-N-C",                                                            
         "http://www.informazione-aziende.it/Azienda_ LETRARI-AZ-AGRICOLA")

thing<-gsub(" ", "", thing)

d <- matrix(nrow=10, ncol=4)
colnames(d)<-c("RAGIONE SOCIALE",'ATTIVITA', 'INDIRIZZO', 'CAP')

for(i in 1:10) {
  print(i)
  a<-thing[i]  
  urls<-html(a)  
  d[i,2] <- try({ urls %>% html_node(".span") %>% html_text() }, silent=TRUE)
  Sys.sleep(2)
}

R Rvest for() and Error server error: (503) Service Unavailable

R Rvest for() and Error server error: (503) Service Unavailable

error-handling

loops

r

scrape

rvest