AttributeError: 'HTMLParser' object has no attribute 'unescape'

Question

我试图提取一些 table html，但 returns 出现一些错误，我不知道为什么。

我真的需要一些帮助

代码：

from bs4 import BeautifulSoup
from io import BytesIO
import requests
import datetime
import re
import rows


# date = datetime.datetime.strptime("2013-1-25", '%Y-%m-%d').strftime('%m/%d/%y')
url = 'http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_MEGA.HTM'

response = requests.get(url)
html = response.content


soup = BeautifulSoup(html, 'lxml')
tabela = soup.find("table")

for tag in tabela.find_all('table'):
    _ = tag.replaceWith('')


soup_tr = tabela.findAll("tr")
lista_tr = list(soup_tr)
lista_tr[0] = lista_tr[1]


s = "".join([str(l) for l in lista_tr])
s = "<table>" + s + "</table>"
s = re.sub("(<!--.*?-->)", "", s, flags=re.DOTALL)


table = rows.import_from_html(BytesIO(bytes(s, encoding='utf-8')))

输出错误如下：

  File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\megasena.py", line 6, in <module>
    import rows
  File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\__init__.py", line 22, in <module>
    import rows.plugins as plugins
  File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\plugins\__init__.py", line 24, in <module>
    from . import plugin_html as html
  File "C:\Users\atendimentopcp300_01\Desktop\Antony\Blue Challenge\venv\lib\site-packages\rows\plugins\plugin_html.py", line 43, in <module>
    unescape = HTMLParser().unescape
AttributeError: 'HTMLParser' object has no attribute 'unescape'

Answer 1

这并不能真正解决您的错误，但还有其他比您已经开始的方法更容易从网站解析表格的方法。

这是其中之一：

import pandas as pd
import requests

page = requests.get("http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_MEGA.HTM")
df = pd.read_html(page.text, flavor="bs4")
print(df)
df = pd.concat(df).to_csv("your_magnificent_table.csv", index=False)

输出：

[     Concurso Data Sorteio  1ª Dezena  ...  Rateio_Quadra  Acumulado  Valor_Acumulado
0           1   11/03/1996          4  ...          33021        SIM      1.714.65023
1           2   18/03/1996          9  ...          20891        NÃO        750.04891
2           3   25/03/1996         10  ...          15301        NÃO              000
3           4   01/04/1996          1  ...          18048        SIM        717.08075
4           5   08/04/1996          1  ...           9653        SIM      1.342.48885
..        ...          ...        ...  ...            ...        ...              ...
397       398   21/09/2002         28  ...          14129        NÃO              000
398       399   25/09/2002         59  ...          22501        SIM      5.676.17141
399       400   28/09/2002         29  ...          20314        SIM      6.869.04791
400       401   02/10/2002         50  ...          28818        SIM      7.859.38989
401       402   05/10/2002         27  ...          14808        SIM      9.248.37354

[402 rows x 16 columns]]

或者，如果您愿意，这里有一个 .csv 文件（实际上是其中的一部分）：

顺便说一句，通过 regular expressions 解析 HTML 是相当不受欢迎的，被认为是一个糟糕的选择。 Here's more on the topic.

AttributeError: 'HTMLParser' object has no attribute 'unescape'

AttributeError: 'HTMLParser' object has no attribute 'unescape'

python

rows

beautifulsoup