net/http 自动将网页重定向到另一种语言

net/http automatically redirects webpage to another language

我正在尝试使用 open-uri 从以下位置抓取数据:

https://www.zomato.com/grande-lisboa/fu-hao-massamá

但是,网站会自动重定向到:

https://www.zomato.com/pt/grande-lisboa/fu-hao-massamá

我不想要西班牙语版本。我要英文的我如何告诉 ruby 停止这样做?

这称为 content negotiation - Web 服务器根据您的请求重定向。 pt(葡萄牙语)似乎是默认设置:(至少从我所在的位置来看)

$ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1
HTTP/1.1 301 Moved Permanently
Set-Cookie: zl=pt; ...
Location: https://www.zomato.com/pt/grande-lisboa/fu-hao-massam%C3%A1

您可以通过发送 Accept-Language header 请求另一种语言。这是 Accept-Language: es(西班牙语)的答案:

$ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1 -H "Accept-Language: es"
HTTP/1.1 301 Moved Permanently
Set-Cookie: zl=es_cl; ...
Location: https://www.zomato.com/es/grande-lisboa/fu-hao-massam%C3%A1

这里是 Accept-Language: en 的答案(英语):

$ curl -I https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1 -H "Accept-Language: en"
HTTP/1.1 200 OK
Set-Cookie: zl=en; ...

这似乎是您一直在寻找的资源。

在 Ruby 中,您将使用:

require 'nokogiri'
require 'open-uri'

url = 'https://www.zomato.com/grande-lisboa/fu-hao-massam%C3%A1'
headers = {'Accept-Language' => 'en'}

doc = Nokogiri::HTML(open(url, headers))
doc.at('html')[:lang]
#=> "en"