Web Scraping 遇到 XML 解析错误,如何修复?
Web Scraping Encountered XML Parsing Error, How to Fix?
我已经使用网络抓取应用程序一年左右了,现在没有任何实际问题。今天早上,我 运行 该程序并从 xml.etree 得到一个不匹配的标签错误。这在今天早上之前从未发生过,所以我有点困惑为什么现在突然发生了。这是我的代码:
import requests
from xml.etree import ElementTree as ET
import json
import datetime as dt
import time
from dateutil import parser
from bs4 import BeautifulSoup
from xml.parsers import expat
url = 'https://www5.fdic.gov/cra/WebServices/DBService.asmx/callWS'
r = requests.post(url, data={"functionName":"SearchCRA","parmsJSON":"{\"Appl_Number\":\"\",\"Appl_Type\":\"\",\"PSTALP\":\"\",\"SUPRV_FDICDBS\":\"09\",\"BANK_NAME\":\"\"}"})
soup = BeautifulSoup(r.text, 'html.parser')
root = ET.fromstring(r.content)
data = json.loads(root.text)
today = dt.date.today()
lastweek = today-dt.timedelta(7)
date = lastweek.strftime("%m/%d/%Y") #one week from today in mm/d/yyyy
mylist = []
for result in data['Result']:
d = parser.parse(result['Appl_Recd_YMD'])
f_d = d.strftime("%m/%d/%Y")
if f_d >= date:
new_status = "***NEW***"
else:
new_status = " "
if 'INTERIM' not in result['Instname'] and result['Inst_Rle_Cde'] == '1' and result['Appl_Type'] in ('REORG ', 'MERGER', 'RELMO', 'FDINEW'):
output4 = 'Date: {} Application Number: {} Institution: {} State: {} Type: {} Link: https://www5.fdic.gov/cra/cram03.aspx?inApplNb={}&inApplType={} {}'.format(f_d, result['Appl_Number'], result['Instname'].strip(),result['Pstalp'], result['Appl_Type'],result['Appl_Number'], result['Appl_Type'], new_status)
item = output4
mylist.append(item)
slist = sorted(mylist)
print(len(mylist), end=""),
print('.)', end = ""),
print(output4)
global slist2
slist2 = slist
这是我得到的错误:
File "C:/Users/d1rjr03/PycharmProjects/Discovery/FDIC.py", line 16, in <module>
root = ET.fromstring(r.content)
File "C:\Python37\lib\xml\etree\ElementTree.py", line 1315, in XML
parser.feed(text)
xml.etree.ElementTree.ParseError: mismatched tag: line 7, column 2
我想知道这个错误是网站问题还是我的访问问题?
我在 xml.etree 方面没有太多经验,所以我不太确定从哪里开始补救这种情况。知道为什么会突然发生这种情况吗?
在将 r.content 传递给 ET.fromstring 之前先看看它的值,这是您所期望的吗?要么将其打印到屏幕,将其写入日志文件,要么使用调试器 运行 代码并检查值。从错误中它表明该值将不是您认为的那样
原来请求 url 从 'www5' 更改为 'www7'。超级简单的答案,坦率地说,我应该在发布之前找到它。
我已经使用网络抓取应用程序一年左右了,现在没有任何实际问题。今天早上,我 运行 该程序并从 xml.etree 得到一个不匹配的标签错误。这在今天早上之前从未发生过,所以我有点困惑为什么现在突然发生了。这是我的代码:
import requests
from xml.etree import ElementTree as ET
import json
import datetime as dt
import time
from dateutil import parser
from bs4 import BeautifulSoup
from xml.parsers import expat
url = 'https://www5.fdic.gov/cra/WebServices/DBService.asmx/callWS'
r = requests.post(url, data={"functionName":"SearchCRA","parmsJSON":"{\"Appl_Number\":\"\",\"Appl_Type\":\"\",\"PSTALP\":\"\",\"SUPRV_FDICDBS\":\"09\",\"BANK_NAME\":\"\"}"})
soup = BeautifulSoup(r.text, 'html.parser')
root = ET.fromstring(r.content)
data = json.loads(root.text)
today = dt.date.today()
lastweek = today-dt.timedelta(7)
date = lastweek.strftime("%m/%d/%Y") #one week from today in mm/d/yyyy
mylist = []
for result in data['Result']:
d = parser.parse(result['Appl_Recd_YMD'])
f_d = d.strftime("%m/%d/%Y")
if f_d >= date:
new_status = "***NEW***"
else:
new_status = " "
if 'INTERIM' not in result['Instname'] and result['Inst_Rle_Cde'] == '1' and result['Appl_Type'] in ('REORG ', 'MERGER', 'RELMO', 'FDINEW'):
output4 = 'Date: {} Application Number: {} Institution: {} State: {} Type: {} Link: https://www5.fdic.gov/cra/cram03.aspx?inApplNb={}&inApplType={} {}'.format(f_d, result['Appl_Number'], result['Instname'].strip(),result['Pstalp'], result['Appl_Type'],result['Appl_Number'], result['Appl_Type'], new_status)
item = output4
mylist.append(item)
slist = sorted(mylist)
print(len(mylist), end=""),
print('.)', end = ""),
print(output4)
global slist2
slist2 = slist
这是我得到的错误:
File "C:/Users/d1rjr03/PycharmProjects/Discovery/FDIC.py", line 16, in <module>
root = ET.fromstring(r.content)
File "C:\Python37\lib\xml\etree\ElementTree.py", line 1315, in XML
parser.feed(text)
xml.etree.ElementTree.ParseError: mismatched tag: line 7, column 2
我想知道这个错误是网站问题还是我的访问问题?
我在 xml.etree 方面没有太多经验,所以我不太确定从哪里开始补救这种情况。知道为什么会突然发生这种情况吗?
在将 r.content 传递给 ET.fromstring 之前先看看它的值,这是您所期望的吗?要么将其打印到屏幕,将其写入日志文件,要么使用调试器 运行 代码并检查值。从错误中它表明该值将不是您认为的那样
原来请求 url 从 'www5' 更改为 'www7'。超级简单的答案,坦率地说,我应该在发布之前找到它。