pycurl 和 curl 在请求相同资源时表现不同; curl 正确地给出一个 JSON 对象,PycURL 一个 HTML 对象

pycurl and curl behaving differently when requesting same resource; curl correctly gives a JSON object, PycURL a HTML object

ipinfo.io 提供有关对应于 IP 地址的 website/server 的信息,方法是在他们的 website 上输入它或通过 curl 命令行实用程序向他们发送请求,例如:

$ curl  https://ipinfo.io/172.217.169.6

输出,格式为JSON:

{
  "ip": "172.217.169.68",
  "hostname": "lhr48s09-in-f4.1e100.net",
  "city": "London",
  "region": "England",
  "country": "GB",
  "loc": "51.5085,-0.1257",
  "org": "AS15169 Google LLC",
  "postal": "EC1A",
  "timezone": "Europe/London",
  "readme": "https://ipinfo.io/missingauth"
}

我最终想做的是在 Python 中执行此操作并将此结果存储为 JSON 对象。我相信下面的代码,使用 pycURL 应该产生相同的输出:

import pycurl
from io import BytesIO

buffer = BytesIO()
c = pycurl.Curl()
c.setopt(c.URL, "https://ipinfo.io/172.217.169.6")
c.setopt(c.WRITEDATA, buffer)
c.perform()
c.close

body = buffer.getvalue()
print(body.decode('iso-8859-1'))

即,将相同的 JSON 字符串写入缓冲区。

但是,它会打印大量 HTML 输出,即我怀疑实际页面 pycURL 中的 HTML 正在请求数据,而不是 JSON 数据。例如:

<!DOCTYPE html>
<html>
<head>
    <title>
    172.217.169.6 IP Address Details
 - IPinfo.io</title>
    <meta charset="utf-8">
    <meta name="apple-itunes-app" content="app-id=917634022">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no, user-scalable=no">
    <meta name="description" content="Full IP address details for 172.217.169.6 (AS15169 Google LLC) including geolocation and map, hostname, and API details.">

    <link rel="manifest" href="/static/manifest.json">
    <link rel="icon" sizes="48x48" href="/static/deviceicons/android-icon-48x48.png">


...
    

</html>

基本上,我怎样才能让 pycURL 也接收这个 JSON 数据?



我尝试比较两者的详细输出,但我无法弄清楚为什么它们的行为不同,只是内容类型字段不同; “application/json”表示 curl,“text/html”表示 pycURL,这解释了不同的输出。冒着使这个 post 非常冗长的风险,我还在下面提供了它们:

curl(命令行) 详细输出:

$ curl -v https://ipinfo.io/172.217.169.6
*   Trying 34.117.59.81:443...
* TCP_NODELAY set
* Connected to ipinfo.io (34.117.59.81) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=ipinfo.io
*  start date: Jul 10 20:18:59 2021 GMT
*  expire date: Oct  8 21:18:59 2021 GMT
*  subjectAltName: host "ipinfo.io" matched cert's "ipinfo.io"
*  issuer: C=US; O=Google Trust Services LLC; CN=GTS CA 1D4
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55a887a40e10)
> GET /172.217.169.6 HTTP/2
> Host: ipinfo.io
> user-agent: curl/7.68.0
> accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 200 
< access-control-allow-origin: *
< x-frame-options: DENY
< x-xss-protection: 1; mode=block
< x-content-type-options: nosniff
< referrer-policy: strict-origin-when-cross-origin
< content-type: application/json; charset=utf-8
< content-length: 286
< date: Tue, 27 Jul 2021 21:03:50 GMT
< x-envoy-upstream-service-time: 1
< via: 1.1 google
< alt-svc: clear
< 
{
  "ip": "172.217.169.6",
  "hostname": "lhr25s26-in-f6.1e100.net",
  "city": "London",
  "region": "England",
  "country": "GB",
  "loc": "51.5085,-0.1257",
  "org": "AS15169 Google LLC",
  "postal": "EC1A",
  "timezone": "Europe/London",
  "readme": "https://ipinfo.io/missingauth"
* Connection #0 to host ipinfo.io left intact
}

pycURL 详细输出:

$ python3 ip_helper.py
*   Trying 34.117.59.81:443...
* TCP_NODELAY set
* Connected to ipinfo.io (34.117.59.81) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=ipinfo.io
*  start date: Jul 10 20:18:59 2021 GMT
*  expire date: Oct  8 21:18:59 2021 GMT
*  subjectAltName: host "ipinfo.io" matched cert's "ipinfo.io"
*  issuer: C=US; O=Google Trust Services LLC; CN=GTS CA 1D4
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x19d65c0)
> GET /172.217.169.6 HTTP/2
Host: ipinfo.io
user-agent: PycURL/7.43.0.6 libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
accept: */*

* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 200 
< access-control-allow-origin: *
< x-frame-options: DENY
< x-xss-protection: 1; mode=block
< x-content-type-options: nosniff
< referrer-policy: strict-origin-when-cross-origin
< content-type: text/html; charset=utf-8
< content-length: 44645
< date: Tue, 27 Jul 2021 21:07:50 GMT
< x-envoy-upstream-service-time: 13
< via: 1.1 google
< alt-svc: clear
< 
* Connection #0 to host ipinfo.io left intact
<!DOCTYPE html>
<html>
<head>
    <title>
    172.217.169.6 IP Address Details
 - IPinfo.io</title>
    <meta charset="utf-8">
    <meta name="apple-itunes-app" content="app-id=917634022">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no, user-scalable=no">
    <meta name="description" content="
    
        Full IP address details for 172.217.169.6 (AS15169 Google LLC) including geolocation and map, hostname, and API details.
    
">

    <link rel="manifest" href="/static/manifest.json">
    <link rel="icon" sizes="48x48" href="/static/deviceicons/android-icon-48x48.png">


...

</html>

感谢您的宝贵时间

来自docs

We try to automatically detect when someone wants to call our API versus view our website, and then we send back the appropriate JSON response rather than HTML. We do this based on the user agent for known popular programming languages, tools, and frameworks. However, there are a couple of other ways to force a JSON response when it doesn't happen automatically. One is to add /json to the URL, and the other is to set an Accept header to application/json

所以看起来有三种不同的方法可以使用 pycurl 恢复 JSON。

  1. /json 附加到您的 URL:
c.setopt(c.URL, "https://ipinfo.io/172.217.169.6/json")
  1. 将您的 Accept header 设置为仅允许 JSON 回复:
c.setopt(c.HTTPHEADER, ["Accept: application/json"])
  1. 设置您的 User-Agent header 使网站认为它正在与 curl 而不是 pycurl:
c.setopt(c.HTTPHEADER, ["User-Agent: curl"])