使用 IP 地址而不是域名来抓取网络服务器的屏幕

Screen scraping a webserver using its IP address instead of its domain name

这可能吗?它在 baseUrl = "http://mashable.com" 时有效,但在我给它一个 IP 地址时无效。

<script src='https://raw.github.com/padolsey/jQuery-Plugins/master/cross-domain-ajax/jquery.xdomainajax.js'></script>
<script>$(document).ready(function () {

baseUrl = "https://12.34.56.78:8000/";
$.ajax({
    url: baseUrl,
    type: "get",
    dataType: "",
    success: function (data) {
        alert("Yeah we are om jere");
    });
});

这会很困难,因为许多网站可能托管在同一台服务器上,因此共享同一 IP。它与域名一起使用,因为您的客户端将它与 GET 请求一起发送到主机 header 中。

查看 Stack Overflow 的 curl 输出:

C:\Users\Yeah>curl --head -i -v whosebug.com/
* Hostname was NOT found in DNS cache
*   Trying 198.252.206.140...
* Connected to whosebug.com (198.252.206.140) port 80 (#0)
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: whosebug.com
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< [...]

您可以看到域名作为 header 传递。 相反,如果我尝试使用上面找到的 IP 地址进行查询,则会导致 404 错误:

C:\Users\Yeah>curl --head -i -v 198.252.206.140/
* Hostname was NOT found in DNS cache
*   Trying 198.252.206.140...
* Connected to 198.252.206.140 (198.252.206.140) port 80 (#0)
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: 198.252.206.140
> Accept: */*
>
< HTTP/1.1 404 Not Found
HTTP/1.1 404 Not Found
< [...]

作为 counter-example,如果我尝试对 Facebook 网站做类似的事情,我会得到以下结果:

C:\Users\Yeah>curl --head -i -v --insecure -L https://www.facebook.com/
* Hostname was NOT found in DNS cache
*   Trying 31.13.93.3...
* Connected to www.facebook.com (31.13.93.3) port 443 (#0)
* [SSL stuff ...]
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: www.facebook.com
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< [...]

然后如果我尝试使用上面的 IP 地址:

C:\Users\Yeah>curl --head -i -v --insecure -L https://31.13.93.3/
* Hostname was NOT found in DNS cache
*   Trying 31.13.93.3...
* Connected to 31.13.93.3 (31.13.93.3) port 443 (#0)
* [SSL stuff ...]
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: 31.13.93.3
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
HTTP/1.1 301 Moved Permanently
< Location: http://www.facebook.com/
Location: http://www.facebook.com/
< [...]

<
* Connection #0 to host 31.13.93.3 left intact
* Issue another request to this URL: 'http://www.facebook.com/'
* Hostname was NOT found in DNS cache
*   Trying 31.13.93.3...
* Connected to www.facebook.com (31.13.93.3) port 80 (#1)
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: www.facebook.com
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
HTTP/1.1 301 Moved Permanently
< [...]

<
* Connection #1 to host www.facebook.com left intact
* Issue another request to this URL: 'https://www.facebook.com/'
* Found bundle for host www.facebook.com: 0x6097814fe0
* Hostname was NOT found in DNS cache
*   Trying 31.13.93.3...
* Connected to www.facebook.com (31.13.93.3) port 443 (#2)
* [SSL stuff ...]
> HEAD / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: www.facebook.com
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< [...]

此处需要 -L(遵循重定向)和 --insecure(接受任何证书)才能使 cUrl 最终连接到 Facebook 网站,但这些是通常的客户端(即浏览器)操作。

所以这实际上取决于您要筛选废料的特定网站和服务器配置。