web-scrapping 在通过不同 headers 后返回 403 错误
web-scrapping returning a 403 error after passing different headers
我正在尝试使用 R 中的包来抓取网站。
当我运行以下内容时:
library(idealisto) #https://github.com/hmeleiro/idealisto
get_city("https://www.idealista.com/alquiler-viviendas/madrid-madrid/", "sale")
我得到:
Error in read_html.response(.) : Forbidden (HTTP 403).
查看函数的更多细节 get_city()
我发现问题出在代码的以下部分:
desktop_agents <- c("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0")
url = "https://www.idealista.com/en/venta-viviendas/madrid-provincia/"
x <- GET(url, add_headers(`user-agent` = desktop_agents[sample(1:10, 1)]))
其中returns输出如下:
Response
[https://www.idealista.com/en/venta-viviendas/madrid-provincia/]
Date: 2022-04-04 18:52 Status: 403 Content-Type:
application/json;charset=utf-8 Size: 360 B
但是,我应该得到 Status: 200
。我尝试手动传递一些 headers
但我仍然得到相同的 Status
错误:
headers = c(
'accept' = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding' = 'gzip, deflate, br',
'accept-language' = 'es-ES,es;q=0.9,en;q=0.8',
'cache-control' = 'max-age=0',
'referer' = 'https://www.idealista.com/en/',
'sec-fetch-mode' = 'navigate',
'sec-fetch-site' = 'same-origin',
'sec-fetch-user' = '?1',
'upgrade-insecure-requests' = '1',
'user-agent' = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
)
url = "https://www.idealista.com/en/venta-viviendas/madrid-provincia/"
x <- GET(url, add_headers(headers))
知道如何解决这个 Status
错误吗?
您对 add_headers
的语法错误。您不能传递命名向量 - 您必须将参数直接传递给函数:
library(httr)
headers <- add_headers(
'accept' = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding' = 'gzip, deflate, br',
'accept-language' = 'es-ES,es;q=0.9,en;q=0.8',
'cache-control' = 'max-age=0',
'referer' = 'https://www.idealista.com/en/',
'sec-fetch-mode' = 'navigate',
'sec-fetch-site' = 'same-origin',
'sec-fetch-user' = '?1',
'upgrade-insecure-requests' = '1',
'user-agent' = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
)
url = "https://www.idealista.com/en/venta-viviendas/madrid-provincia/"
GET(url, headers)
#> Response [https://www.idealista.com/en/venta-viviendas/madrid-provincia/]
#> Date: 2022-04-04 19:10
#> Status: 200
#> Content-Type: text/html; charset=UTF-8
#> Size: 263 kB
#> <!DOCTYPE html>
#> <html lang="en" env="es" username="" data-userauth="false" class="">
#> <head>
#> <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
#> <title>Property for sale in Madrid province, Spain: houses and flats — ...
#> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
#> <meta name="description" content="37,980 houses and flats for sale in Madrid,...
#> <meta name="author" content="idealista.com">
#> <meta http-equiv="cleartype" content="on">
#> <meta name="pragma" content="no-cache"/>
#> ...
由 reprex package (v2.0.1)
于 2022-04-04 创建
我正在尝试使用 R 中的包来抓取网站。
当我运行以下内容时:
library(idealisto) #https://github.com/hmeleiro/idealisto
get_city("https://www.idealista.com/alquiler-viviendas/madrid-madrid/", "sale")
我得到:
Error in read_html.response(.) : Forbidden (HTTP 403).
查看函数的更多细节 get_city()
我发现问题出在代码的以下部分:
desktop_agents <- c("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0")
url = "https://www.idealista.com/en/venta-viviendas/madrid-provincia/"
x <- GET(url, add_headers(`user-agent` = desktop_agents[sample(1:10, 1)]))
其中returns输出如下:
Response [https://www.idealista.com/en/venta-viviendas/madrid-provincia/]
Date: 2022-04-04 18:52 Status: 403 Content-Type: application/json;charset=utf-8 Size: 360 B
但是,我应该得到 Status: 200
。我尝试手动传递一些 headers
但我仍然得到相同的 Status
错误:
headers = c(
'accept' = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding' = 'gzip, deflate, br',
'accept-language' = 'es-ES,es;q=0.9,en;q=0.8',
'cache-control' = 'max-age=0',
'referer' = 'https://www.idealista.com/en/',
'sec-fetch-mode' = 'navigate',
'sec-fetch-site' = 'same-origin',
'sec-fetch-user' = '?1',
'upgrade-insecure-requests' = '1',
'user-agent' = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
)
url = "https://www.idealista.com/en/venta-viviendas/madrid-provincia/"
x <- GET(url, add_headers(headers))
知道如何解决这个 Status
错误吗?
您对 add_headers
的语法错误。您不能传递命名向量 - 您必须将参数直接传递给函数:
library(httr)
headers <- add_headers(
'accept' = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding' = 'gzip, deflate, br',
'accept-language' = 'es-ES,es;q=0.9,en;q=0.8',
'cache-control' = 'max-age=0',
'referer' = 'https://www.idealista.com/en/',
'sec-fetch-mode' = 'navigate',
'sec-fetch-site' = 'same-origin',
'sec-fetch-user' = '?1',
'upgrade-insecure-requests' = '1',
'user-agent' = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
)
url = "https://www.idealista.com/en/venta-viviendas/madrid-provincia/"
GET(url, headers)
#> Response [https://www.idealista.com/en/venta-viviendas/madrid-provincia/]
#> Date: 2022-04-04 19:10
#> Status: 200
#> Content-Type: text/html; charset=UTF-8
#> Size: 263 kB
#> <!DOCTYPE html>
#> <html lang="en" env="es" username="" data-userauth="false" class="">
#> <head>
#> <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
#> <title>Property for sale in Madrid province, Spain: houses and flats — ...
#> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
#> <meta name="description" content="37,980 houses and flats for sale in Madrid,...
#> <meta name="author" content="idealista.com">
#> <meta http-equiv="cleartype" content="on">
#> <meta name="pragma" content="no-cache"/>
#> ...
由 reprex package (v2.0.1)
于 2022-04-04 创建