AttributeError: 'HTMLParser' object has no attribute 'unescape'
AttributeError: 'HTMLParser' object has no attribute 'unescape'
我试图提取一些 table html,但 returns 出现一些错误,我不知道为什么。
我真的需要一些帮助
代码:
from bs4 import BeautifulSoup
from io import BytesIO
import requests
import datetime
import re
import rows
# date = datetime.datetime.strptime("2013-1-25", '%Y-%m-%d').strftime('%m/%d/%y')
url = 'http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_MEGA.HTM'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'lxml')
tabela = soup.find("table")
for tag in tabela.find_all('table'):
_ = tag.replaceWith('')
soup_tr = tabela.findAll("tr")
lista_tr = list(soup_tr)
lista_tr[0] = lista_tr[1]
s = "".join([str(l) for l in lista_tr])
s = "<table>" + s + "</table>"
s = re.sub("(<!--.*?-->)", "", s, flags=re.DOTALL)
table = rows.import_from_html(BytesIO(bytes(s, encoding='utf-8')))
输出错误如下:
File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\megasena.py", line 6, in <module>
import rows
File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\__init__.py", line 22, in <module>
import rows.plugins as plugins
File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\plugins\__init__.py", line 24, in <module>
from . import plugin_html as html
File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\plugins\plugin_html.py", line 43, in <module>
unescape = HTMLParser().unescape
AttributeError: 'HTMLParser' object has no attribute 'unescape'
这并不能真正解决您的错误,但还有其他比您已经开始的方法更容易从网站解析表格的方法。
这是其中之一:
import pandas as pd
import requests
page = requests.get("http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_MEGA.HTM")
df = pd.read_html(page.text, flavor="bs4")
print(df)
df = pd.concat(df).to_csv("your_magnificent_table.csv", index=False)
输出:
[ Concurso Data Sorteio 1ª Dezena ... Rateio_Quadra Acumulado Valor_Acumulado
0 1 11/03/1996 4 ... 33021 SIM 1.714.65023
1 2 18/03/1996 9 ... 20891 NÃO 750.04891
2 3 25/03/1996 10 ... 15301 NÃO 000
3 4 01/04/1996 1 ... 18048 SIM 717.08075
4 5 08/04/1996 1 ... 9653 SIM 1.342.48885
.. ... ... ... ... ... ... ...
397 398 21/09/2002 28 ... 14129 NÃO 000
398 399 25/09/2002 59 ... 22501 SIM 5.676.17141
399 400 28/09/2002 29 ... 20314 SIM 6.869.04791
400 401 02/10/2002 50 ... 28818 SIM 7.859.38989
401 402 05/10/2002 27 ... 14808 SIM 9.248.37354
[402 rows x 16 columns]]
或者,如果您愿意,这里有一个 .csv
文件(实际上是其中的一部分):
顺便说一句,通过 regular expressions
解析 HTML
是相当不受欢迎的,被认为是一个糟糕的选择。 Here's more on the topic.
我试图提取一些 table html,但 returns 出现一些错误,我不知道为什么。
我真的需要一些帮助
代码:
from bs4 import BeautifulSoup
from io import BytesIO
import requests
import datetime
import re
import rows
# date = datetime.datetime.strptime("2013-1-25", '%Y-%m-%d').strftime('%m/%d/%y')
url = 'http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_MEGA.HTM'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'lxml')
tabela = soup.find("table")
for tag in tabela.find_all('table'):
_ = tag.replaceWith('')
soup_tr = tabela.findAll("tr")
lista_tr = list(soup_tr)
lista_tr[0] = lista_tr[1]
s = "".join([str(l) for l in lista_tr])
s = "<table>" + s + "</table>"
s = re.sub("(<!--.*?-->)", "", s, flags=re.DOTALL)
table = rows.import_from_html(BytesIO(bytes(s, encoding='utf-8')))
输出错误如下:
File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\megasena.py", line 6, in <module>
import rows
File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\__init__.py", line 22, in <module>
import rows.plugins as plugins
File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\plugins\__init__.py", line 24, in <module>
from . import plugin_html as html
File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\plugins\plugin_html.py", line 43, in <module>
unescape = HTMLParser().unescape
AttributeError: 'HTMLParser' object has no attribute 'unescape'
这并不能真正解决您的错误,但还有其他比您已经开始的方法更容易从网站解析表格的方法。
这是其中之一:
import pandas as pd
import requests
page = requests.get("http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_MEGA.HTM")
df = pd.read_html(page.text, flavor="bs4")
print(df)
df = pd.concat(df).to_csv("your_magnificent_table.csv", index=False)
输出:
[ Concurso Data Sorteio 1ª Dezena ... Rateio_Quadra Acumulado Valor_Acumulado
0 1 11/03/1996 4 ... 33021 SIM 1.714.65023
1 2 18/03/1996 9 ... 20891 NÃO 750.04891
2 3 25/03/1996 10 ... 15301 NÃO 000
3 4 01/04/1996 1 ... 18048 SIM 717.08075
4 5 08/04/1996 1 ... 9653 SIM 1.342.48885
.. ... ... ... ... ... ... ...
397 398 21/09/2002 28 ... 14129 NÃO 000
398 399 25/09/2002 59 ... 22501 SIM 5.676.17141
399 400 28/09/2002 29 ... 20314 SIM 6.869.04791
400 401 02/10/2002 50 ... 28818 SIM 7.859.38989
401 402 05/10/2002 27 ... 14808 SIM 9.248.37354
[402 rows x 16 columns]]
或者,如果您愿意,这里有一个 .csv
文件(实际上是其中的一部分):
顺便说一句,通过 regular expressions
解析 HTML
是相当不受欢迎的,被认为是一个糟糕的选择。 Here's more on the topic.