How can parse the url when encountering error "xml.parsers.expat.ExpatError: mismatched tag"?
How can parse the url when encountering error "xml.parsers.expat.ExpatError: mismatched tag"?
我想提取网页中元素 DOCUMENT
中的所有链接:
import urllib.request
url = 'https://www.sec.gov/Archives/edgar/data/1326801/000132680120000013/0001326801-20-000013-index-headers.html'
ob=urllib.request.urlopen(url).read()
from xml.dom import minidom
xmldoc = minidom.parseString(ob)
遇到问题:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.5/xml/dom/minidom.py", line 1968, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python3.5/xml/dom/expatbuilder.py", line 925, in parseString
return builder.parseString(string)
File "/usr/lib/python3.5/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: mismatched tag: line 876, column 23
可能是格式错误的 xml 文件,如何用 minidom 加载它?
我不知道这个文件是什么,但它不是 XML,并且无法使用 XML 解析器进行解析。
是的,它不是 xml 文件,用 lxml.html 解析它,select 所有带有 xpath 的 url。
import urllib.request
url = 'https://www.sec.gov/Archives/edgar/data/1326801/000132680120000013/0001326801-20-000013-index-headers.html'
ob=urllib.request.urlopen(url).read()
doc = lxml.html.fromstring(ob)
links = doc.xpath('//pre/a')
for link in links:
print(link.attrib['href'])
我想提取网页中元素 DOCUMENT
中的所有链接:
import urllib.request
url = 'https://www.sec.gov/Archives/edgar/data/1326801/000132680120000013/0001326801-20-000013-index-headers.html'
ob=urllib.request.urlopen(url).read()
from xml.dom import minidom
xmldoc = minidom.parseString(ob)
遇到问题:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.5/xml/dom/minidom.py", line 1968, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python3.5/xml/dom/expatbuilder.py", line 925, in parseString
return builder.parseString(string)
File "/usr/lib/python3.5/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: mismatched tag: line 876, column 23
可能是格式错误的 xml 文件,如何用 minidom 加载它?
我不知道这个文件是什么,但它不是 XML,并且无法使用 XML 解析器进行解析。
是的,它不是 xml 文件,用 lxml.html 解析它,select 所有带有 xpath 的 url。
import urllib.request
url = 'https://www.sec.gov/Archives/edgar/data/1326801/000132680120000013/0001326801-20-000013-index-headers.html'
ob=urllib.request.urlopen(url).read()
doc = lxml.html.fromstring(ob)
links = doc.xpath('//pre/a')
for link in links:
print(link.attrib['href'])