解析错误的 XHTML
Parsing bad XHTML
我的新项目是从 Naxos Glossary of Musical Terms, 一个很好的资源中提取数据,我想处理它的文本数据并将其提取到数据库中,以便在我将创建的另一个更简单的网站上使用。
我唯一的问题是糟糕的 XHTML 格式。这
W3C XHTML validation raises 318 errors and 54 warnings. Even a HTML Tidier 我发现无法解决所有问题。
我正在使用 Python 3.67,我正在解析的页面是 ASP。我已经测试了 LXML 和 Python XML 模块,但都失败了。
任何人都可以推荐任何其他整理器或模块吗?还是我必须使用某种原始文本操作(糟糕!)?
我的代码:
LXML:
from lxml import etree
file = open("glossary.asp", "r", encoding="ISO-8859-1")
parsed = etree.parse(file)
错误:
Traceback (most recent call last):
File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
parsed = etree.parse(file)
File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1861, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1881, in lxml.etree._parseFilelikeDocument
File "src/lxml/parser.pxi", line 1776, in lxml.etree._parseDocFromFilelike
File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "/media/skuzzyneon/STORE-1/naxos_dict/glossary.asp", line 25
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 25, column 128
>>>
Python XML(使用整理后的 XHTML):
import xml.etree.ElementTree as ET
file = open("tidy.html", "r", encoding="ISO-8859-1")
root = ET.fromstring(file.read())
# Top-level elements
print(root.findall("."))
错误:
Traceback (most recent call last):
File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
root = ET.fromstring(file.read())
File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
parser.feed(text)
File "<string>", line None
xml.etree.ElementTree.ParseError: undefined entity: line 526, column 33
Lxml 可能认为你是这样 xml 的。
像这样尝试:
from lxml import html
from cssselect import GenericTranslator, SelectorError
file = open("glossary.asp", "r", encoding="ISO-8859-1")
doc = html.document_fromstring(file.read())
print(doc.cssselect('title')[0].text_content())
另外,不用 "HTML Tidiers",只需在 chrome 中打开它,然后在元素面板中复制 html。
我的新项目是从 Naxos Glossary of Musical Terms, 一个很好的资源中提取数据,我想处理它的文本数据并将其提取到数据库中,以便在我将创建的另一个更简单的网站上使用。
我唯一的问题是糟糕的 XHTML 格式。这 W3C XHTML validation raises 318 errors and 54 warnings. Even a HTML Tidier 我发现无法解决所有问题。
我正在使用 Python 3.67,我正在解析的页面是 ASP。我已经测试了 LXML 和 Python XML 模块,但都失败了。
任何人都可以推荐任何其他整理器或模块吗?还是我必须使用某种原始文本操作(糟糕!)?
我的代码:
LXML:
from lxml import etree
file = open("glossary.asp", "r", encoding="ISO-8859-1")
parsed = etree.parse(file)
错误:
Traceback (most recent call last):
File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
parsed = etree.parse(file)
File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1861, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1881, in lxml.etree._parseFilelikeDocument
File "src/lxml/parser.pxi", line 1776, in lxml.etree._parseDocFromFilelike
File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "/media/skuzzyneon/STORE-1/naxos_dict/glossary.asp", line 25
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 25, column 128
>>>
Python XML(使用整理后的 XHTML):
import xml.etree.ElementTree as ET
file = open("tidy.html", "r", encoding="ISO-8859-1")
root = ET.fromstring(file.read())
# Top-level elements
print(root.findall("."))
错误:
Traceback (most recent call last):
File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
root = ET.fromstring(file.read())
File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
parser.feed(text)
File "<string>", line None
xml.etree.ElementTree.ParseError: undefined entity: line 526, column 33
Lxml 可能认为你是这样 xml 的。 像这样尝试:
from lxml import html
from cssselect import GenericTranslator, SelectorError
file = open("glossary.asp", "r", encoding="ISO-8859-1")
doc = html.document_fromstring(file.read())
print(doc.cssselect('title')[0].text_content())
另外,不用 "HTML Tidiers",只需在 chrome 中打开它,然后在元素面板中复制 html。