Python XML 解析,lxml,urllib.request
Python XML parsing, lxml, urllib.request
我在尝试解析从 url 检索到的 XML 文件时遇到了一点困难,我的目标是将此 xml 文件转换为结构良好的对象,以便轻松检索其数据。我当前的代码导致以下错误:
>>> tree = etree.parse(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72421)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:105883)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106182)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105181)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100131)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94254)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95690)
File "parser.pxi", line 618, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:94722)
OSError: Error reading file '<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
代码:
(scraper) gmf:scr gmf$ python3
Python 3.4.2 (default, Jan 2 2015, 20:14:16)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>> from lxml import etree
>>>
>>> opener = urllib.request.build_opener()
>>> f = opener.open('https://nordfront.se/feed')
data = f.read()
f.close()
>>> tree = etree.parse(data)
非常感谢你的帮助
根据文档字符串(参见 help(ET.parse)
),ET.parse
需要第一个参数
成为
一个文件name/path
import lxml.etree as ET
tree = ET.parse(filename)
一个文件对象
with open('data.xml') as f:
tree = ET.parse(f)
类文件对象
import io
tree = ET.parse(io.BytesIO(data))
a URL 使用 HTTP 或 FTP 协议
import urllib.request
opener = urllib.request.build_opener()
tree = ET.parse(opener.open(url))
这最后一个选项,将 opener.open(url)
直接传递给 ET.parse
而不是定义 data = f.read()
可能是您想要使用的选项。
或者,当您已经在字符串 data
中包含 XML 时,您可以使用 ET.fromstring
:
root = ET.fromstring(data)
但是请注意,parse
returns 和 ElementTree
,而 fromstring
returns 和 Element
。
我在尝试解析从 url 检索到的 XML 文件时遇到了一点困难,我的目标是将此 xml 文件转换为结构良好的对象,以便轻松检索其数据。我当前的代码导致以下错误:
>>> tree = etree.parse(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72421)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:105883)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106182)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105181)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100131)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94254)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95690)
File "parser.pxi", line 618, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:94722)
OSError: Error reading file '<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
代码:
(scraper) gmf:scr gmf$ python3
Python 3.4.2 (default, Jan 2 2015, 20:14:16)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>> from lxml import etree
>>>
>>> opener = urllib.request.build_opener()
>>> f = opener.open('https://nordfront.se/feed')
data = f.read()
f.close()
>>> tree = etree.parse(data)
非常感谢你的帮助
根据文档字符串(参见 help(ET.parse)
),ET.parse
需要第一个参数
成为
一个文件name/path
import lxml.etree as ET tree = ET.parse(filename)
一个文件对象
with open('data.xml') as f: tree = ET.parse(f)
类文件对象
import io tree = ET.parse(io.BytesIO(data))
a URL 使用 HTTP 或 FTP 协议
import urllib.request opener = urllib.request.build_opener() tree = ET.parse(opener.open(url))
这最后一个选项,将 opener.open(url)
直接传递给 ET.parse
而不是定义 data = f.read()
可能是您想要使用的选项。
或者,当您已经在字符串 data
中包含 XML 时,您可以使用 ET.fromstring
:
root = ET.fromstring(data)
但是请注意,parse
returns 和 ElementTree
,而 fromstring
returns 和 Element
。