python lxml.html.parse 没读 url

Question

为什么 html.parse(url) 失败，当使用 requests 然后 html.fromstring 有效而 html.parse(url2) 有效？ lxml 3.4.2

    Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import requests
>>> from lxml import html
>>> url = 'http://www.oddschecker.com'
>>> page = requests.get(url).content
>>> tree = html.fromstring(page)
>>> html.parse(url)

Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    html.parse(url)
  File "C:\program files\Python27\lib\site-packages\lxml\html\__init__.py", line 788, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 3301, in lxml.etree.parse (src\lxml\lxml.etree.c:72453)
  File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:105915)
  File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106214)
  File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:105213)
  File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:100163)
  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94286)
  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:95722)
  File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:94754)
IOError: Error reading file 'http://www.oddschecker.com': failed to load HTTP resource
>>> url2 = 'http://www.google.com'
>>> html.parse(url2)
<lxml.etree._ElementTree object at 0x00000000033BAF88>

Answer 1

当http状态不是200时，html.parse会退出。

http://www.oddschecker.com 的 return 状态。

Answer 2

为@michael_stackof的回答添加一些说明。如果未提供 User-Agent header，此特定 URL 将 return 403 Forbidden 状态代码。

根据 lxml's source code，它使用 urllib2.urlopen() 而不提供 User-Agent header 结果为 403，结果为 failed to load HTTP resource 错误。

另一方面，requests 提供了一个默认值 User-Agent header 如果没有显式传递：

>>> requests.get(url).request.headers['User-Agent']
'python-requests/2.3.0 CPython/2.7.6 Darwin/14.1.0'

为了证明这一点，将 User-Agent header 设置为 None 并查看：

>>> requests.get(url).status_code
200
>>> requests.get(url, headers={'User-Agent': None}).status_code
403

python lxml.html.parse 没读 url

python lxml.html.parse not reading url

python

lxml

python-requests