使用 urllib 时 etree 生成错误
etree generating error when using urlib
我正在尝试使用 the solutions in this post 将 HTML table 解析为 python (2.7)。
当我用字符串尝试前两个中的任何一个时(如示例中所示),它工作得很好。
但是,当我尝试在 HTML 页面上使用 etree.xml 时,我使用 urlib 进行读取,但出现错误。我对每个解决方案都进行了检查,我传递的变量也是一个 str 。
对于以下代码:
from lxml import etree
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = etree.XML(s)
我收到这个错误:
File "C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line
9, in table = etree.XML(s)
File "lxml.etree.pyx", line 2723, in lxml.etree.XML
(src/lxml/lxml.etree.c:52448)
File "parser.pxi", line 1573, in lxml.etree._parseMemoryDocument
(src/lxml/lxml.etree.c:79932)
File "parser.pxi", line 1452, in lxml.etree._parseDoc
(src/lxml/lxml.etree.c:78774)
File "parser.pxi", line 960, in lxml.etree._BaseParser._parseDoc
(src/lxml/lxml.etree.c:75389)
File "parser.pxi", line 564, in
lxml.etree._ParserContext._handleParseResultDoc
(src/lxml/lxml.etree.c:71739)
File "parser.pxi", line 645, in lxml.etree._handleParseResult
(src/lxml/lxml.etree.c:72614)
File "parser.pxi", line 585, in lxml.etree._raiseParseError
(src/lxml/lxml.etree.c:71955) lxml.etree.XMLSyntaxError: Opening and
ending tag mismatch: link line 8 and head, line 8, column 48
对于此代码:
from xml.etree import ElementTree as ET
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = ET.XML(s)
我收到这个错误:
Traceback (most recent call last): File
"C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line 6, in
table = ET.XML(s)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1300, in XML
parser.feed(text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in
_raiseerror
raise err xml.etree.ElementTree.ParseError: mismatched tag: line 8, column 111
虽然它们看起来可能是相同的标记类型,但 HTML 并不像 XML 那样严格,需要格式正确并遵循标记规则(opening/closing 节点、转义实体等.).因此,HTML 通过的内容可能不允许 XML.
因此,考虑使用etree的HTML()功能来解析页面。此外,您可以使用 XPath 来定位您打算提取或使用的特定区域。下面是一个试图拉取主页 table 的例子。请注意该网页使用了相当多的嵌套 tables.
from lxml import etree
import urllib.request as rq
yearurl = "http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s = rq.urlopen(yearurl).read()
print(type(s))
# PARSE PAGE
htmlpage = etree.HTML(s)
# XPATH TO SPECIFIC CONTENT
htmltable = htmlpage.xpath("//table[tr/td/font/a/b='Rank']//text()")
for row in htmltable:
print(row)
我正在尝试使用 the solutions in this post 将 HTML table 解析为 python (2.7)。 当我用字符串尝试前两个中的任何一个时(如示例中所示),它工作得很好。 但是,当我尝试在 HTML 页面上使用 etree.xml 时,我使用 urlib 进行读取,但出现错误。我对每个解决方案都进行了检查,我传递的变量也是一个 str 。 对于以下代码:
from lxml import etree
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = etree.XML(s)
我收到这个错误:
File "C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line 9, in table = etree.XML(s)
File "lxml.etree.pyx", line 2723, in lxml.etree.XML (src/lxml/lxml.etree.c:52448)
File "parser.pxi", line 1573, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79932)
File "parser.pxi", line 1452, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78774)
File "parser.pxi", line 960, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75389)
File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739)
File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)
File "parser.pxi", line 585, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955) lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: link line 8 and head, line 8, column 48
对于此代码:
from xml.etree import ElementTree as ET
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = ET.XML(s)
我收到这个错误:
Traceback (most recent call last): File "C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line 6, in table = ET.XML(s)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1300, in XML parser.feed(text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror raise err xml.etree.ElementTree.ParseError: mismatched tag: line 8, column 111
虽然它们看起来可能是相同的标记类型,但 HTML 并不像 XML 那样严格,需要格式正确并遵循标记规则(opening/closing 节点、转义实体等.).因此,HTML 通过的内容可能不允许 XML.
因此,考虑使用etree的HTML()功能来解析页面。此外,您可以使用 XPath 来定位您打算提取或使用的特定区域。下面是一个试图拉取主页 table 的例子。请注意该网页使用了相当多的嵌套 tables.
from lxml import etree
import urllib.request as rq
yearurl = "http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s = rq.urlopen(yearurl).read()
print(type(s))
# PARSE PAGE
htmlpage = etree.HTML(s)
# XPATH TO SPECIFIC CONTENT
htmltable = htmlpage.xpath("//table[tr/td/font/a/b='Rank']//text()")
for row in htmltable:
print(row)