python lxml.html.parse 没读 url
python lxml.html.parse not reading url
为什么 html.parse(url)
失败,当使用 requests
然后 html.fromstring
有效而 html.parse(url2)
有效? lxml 3.4.2
Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import requests
>>> from lxml import html
>>> url = 'http://www.oddschecker.com'
>>> page = requests.get(url).content
>>> tree = html.fromstring(page)
>>> html.parse(url)
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
html.parse(url)
File "C:\program files\Python27\lib\site-packages\lxml\html\__init__.py", line 788, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 3301, in lxml.etree.parse (src\lxml\lxml.etree.c:72453)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:105915)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106214)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:105213)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:100163)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94286)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:95722)
File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:94754)
IOError: Error reading file 'http://www.oddschecker.com': failed to load HTTP resource
>>> url2 = 'http://www.google.com'
>>> html.parse(url2)
<lxml.etree._ElementTree object at 0x00000000033BAF88>
当http状态不是200时,html.parse会退出。
http://www.oddschecker.com 的 return 状态。
为@michael_stackof的回答添加一些说明。如果未提供 User-Agent
header,此特定 URL 将 return 403 Forbidden
状态代码。
根据 lxml
's source code,它使用 urllib2.urlopen()
而不提供 User-Agent
header 结果为 403
,结果为 failed to load HTTP resource
错误。
另一方面,requests
提供了一个默认值 User-Agent
header 如果没有显式传递:
>>> requests.get(url).request.headers['User-Agent']
'python-requests/2.3.0 CPython/2.7.6 Darwin/14.1.0'
为了证明这一点,将 User-Agent
header 设置为 None
并查看:
>>> requests.get(url).status_code
200
>>> requests.get(url, headers={'User-Agent': None}).status_code
403
为什么 html.parse(url)
失败,当使用 requests
然后 html.fromstring
有效而 html.parse(url2)
有效? lxml 3.4.2
Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import requests
>>> from lxml import html
>>> url = 'http://www.oddschecker.com'
>>> page = requests.get(url).content
>>> tree = html.fromstring(page)
>>> html.parse(url)
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
html.parse(url)
File "C:\program files\Python27\lib\site-packages\lxml\html\__init__.py", line 788, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 3301, in lxml.etree.parse (src\lxml\lxml.etree.c:72453)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:105915)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106214)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:105213)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:100163)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94286)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:95722)
File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:94754)
IOError: Error reading file 'http://www.oddschecker.com': failed to load HTTP resource
>>> url2 = 'http://www.google.com'
>>> html.parse(url2)
<lxml.etree._ElementTree object at 0x00000000033BAF88>
当http状态不是200时,html.parse会退出。
http://www.oddschecker.com 的 return 状态。
为@michael_stackof的回答添加一些说明。如果未提供 User-Agent
header,此特定 URL 将 return 403 Forbidden
状态代码。
根据 lxml
's source code,它使用 urllib2.urlopen()
而不提供 User-Agent
header 结果为 403
,结果为 failed to load HTTP resource
错误。
另一方面,requests
提供了一个默认值 User-Agent
header 如果没有显式传递:
>>> requests.get(url).request.headers['User-Agent']
'python-requests/2.3.0 CPython/2.7.6 Darwin/14.1.0'
为了证明这一点,将 User-Agent
header 设置为 None
并查看:
>>> requests.get(url).status_code
200
>>> requests.get(url, headers={'User-Agent': None}).status_code
403