Python 尽管网页未更改，程序仍抓取不同的文本

Question

此代码试图通过第一方亚马逊供应商抓取亚马逊列表以检查其可用性。

from lxml import html
from time import sleep
import requests
import time

Amazonurl = raw_input("Item URL: ")

page = requests.get(Amazonurl)
tree = html.fromstring(page.text)

Stock = tree.xpath('//*[@id="merchant-info"]/text()')
IfInstock = ''.join(Stock)


if 'Ships from and sold by Amazon.com.' in IfInstock:
    print 'Instock'
    print time.strftime("%a, %d %b %Y %H:%M:%S")

else:
    print 'Not in Stock'
    print time.strftime("%a, %d %b %Y %H:%M:%S")

奇怪的是，当我插入时，比如说，最近几天没有缺货的http://www.amazon.com/New-Nintendo-3DS-XL-Black/dp/B00S1LRX3W/ref=sr_1_1?ie=UTF8&qid=1438413018&sr=8-1&keywords=new+3ds，有时代码会return "Instock"，而其他次，它将 return "Not in stock"。我发现这是因为代码经常擦除

[]

而其他时候，它应该抓取以下内容。

['\n    \n    \n\n    \n        \n        \n    \n    \n    \n    \n    \n    \n    \n    \n    \n    \n    \n    \n    \n        Ships from and sold by Amazon.com.\n    \n    \n        \n        \n        \n        \n        \n        \n        Gift-wrap available.\n        \n\n']

网页似乎没有变化。有谁知道为什么我的输出经常变化，也许还有关于如何解决这个问题的解释？提前致谢。

Answer 1

亚马逊拒绝为您提供此页面。

我刚刚在您的脚本中添加了一行代码，只是为了查看当您获得 odd 结果时响应的 status_code 是什么。

from lxml import html
from time import sleep
import requests
import time

Amazonurl = "http://www.amazon.com/dp/B00S1LRX3W/?tag=stackoverfl08-20"
intent = 0
while True:
    page = requests.get(Amazonurl)
    tree = html.fromstring(page.text)

    print(page.status_code)

    Stock = tree.xpath('//*[@id="merchant-info"]/text()')
    IfInstock = ''.join(Stock)

    if 'Ships from and sold by Amazon.com.' in IfInstock:
        print('Instock')
        print(time.strftime("%a, %d %b %Y %H:%M:%S"))

    else:
        print('Not in Stock')
        print(time.strftime("%a, %d %b %Y %H:%M:%S"))

    time.sleep(15)

    if intent>15:
        break
    intent += 1

我运行这个脚本的时间间隔是15秒，就像你说的那样。结果如下：

200
Instock
Sat, 01 Aug 2015 19:51:27
200
Instock
Sat, 01 Aug 2015 19:51:43
503
Not in Stock
Sat, 01 Aug 2015 19:51:59
200
Instock
Sat, 01 Aug 2015 19:52:15
200
Instock
Sat, 01 Aug 2015 19:52:32
200
Instock
Sat, 01 Aug 2015 19:52:48
200
Instock
Sat, 01 Aug 2015 19:53:05
200
Instock
Sat, 01 Aug 2015 19:53:22
200
Instock
Sat, 01 Aug 2015 19:53:38
200
Instock
Sat, 01 Aug 2015 19:53:55
200
Instock
Sat, 01 Aug 2015 19:54:12
200
Instock
Sat, 01 Aug 2015 19:54:29
200
Instock
Sat, 01 Aug 2015 19:54:45
200
Instock
Sat, 01 Aug 2015 19:55:02
200
Instock
Sat, 01 Aug 2015 19:55:18
200
Instock
Sat, 01 Aug 2015 19:55:35
200
Instock
Sat, 01 Aug 2015 19:55:52

您可以看到，当结果为 odd 或 "Not in Stock" 时，status_code 为 503。根据 http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html 的定义如下：

10.5.4 503 Service Unavailable The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.
  Note: The existence of the 503 status code does not imply that a
  server must use it when becoming overloaded. Some servers may wish
  to simply refuse the connection.

也就是说，亚马逊不会为您提供此页面，因为您在短时间内提出了多个请求。 "short" 时间对亚马逊来说实际上并没有那么苛刻，这就是为什么你大部分时间都得到 200 status_code.

希望能回答您的问题。现在，如果您真的想废弃像 Amazon 这样的网站，我建议您使用 Scrapy，它非常易于使用且易于配置。您可以通过使用运行dom user-agent 摆脱像亚马逊这样的网站。但是，当然，这只是您原始问题的附加内容。

Python 尽管网页未更改，程序仍抓取不同的文本

Python Program Scraping Different Text Despite Webpage Not Changing

python

lxml

amazon

web-scraping

python-requests