解析 html 页面的一些问题

Question

我正在尝试构建程序来解析 html 页面。在这种情况下，我无法直接获得 url，所以我要求用户下载一个 html 页面并使用它。

# -*- coding: UTF-8 -*-

from re import findall
from bs4 import BeautifulSoup


# INPUT
def inside(html_path, errors='ignore'):
    with open(html_path, errors=errors) as fp:
        soup = BeautifulSoup(fp, features='lxml')
    return soup


def pairing(html_path, errors='ignore') -> dict:
    use_dict = {}

    soup = inside(html_path=html_path, errors=errors)

    for pair in zip(
            soup.find_all('div', {"class": "audio_row__performers"}),
            soup.find_all('span', {"class": "audio_row__title_inner _audio_row__title_inner"}),
            soup.find_all('span', {"class": "audio_row__title_inner_subtitle _audio_row__title_inner_subtitle"})
    ):
        """
        pair[0] - musician(-s), 
        pair[1] - track_name, 
        pair[2] - subtitle for track(if any)
        """

        track_author = pair[0].find('a').text

        pair_2_str = str(pair[2])

        regex = "(?<=>).*?(?=<)"
        add_meta = findall(regex, pair_2_str)[0]

        track_name = pair[1].text + f" {add_meta}"

        use_dict.update({track_author: track_name})
    return use_dict

在使用 errors='replace' 执行后，我得到这样的结果：

('The Offspring', 'Dividing By Zero ')
('ACDC', 'Hightway to Hell ')
('����(�.�. ���)', '������ ����� �� ������ ')
('Haddaway', "What is love, baby don't hurt me. ")
("Guns N' Roses", 'Catcher in the rye ')
('Queen', 'Dont stop me now (�������� � ������)  ')
('The Subways', 'Rock & Roll Queen ')
('Fetty Wap', 'Trap Queen ')

我以为我找错了页面，所以我查看了元数据，不幸的是，我发现了这个：

      <div class="audio_row__performer_title">
        <div onmouseover="setTitle(this)" class="audio_row__performers"><a href="https://vk.com/audio?performer=1&amp;q=%D0%9A%D0%B8%D0%BD%D0%BE%28%D0%92.%D0%A0.%20%D0%A6%D0%BE%D0%B9%29">Кино(В.Р. Цой)</a></div>
        <div class="audio_row__title _audio_row__title" onmouseover="setTitle(this)">
          <span class="audio_row__title_inner _audio_row__title_inner">Группа крови на рукаве</span>
          <span class="audio_row__title_inner_subtitle _audio_row__title_inner_subtitle"></span>
        </div>
      </div>
      <div class="audio_row__info _audio_row__info"><div class="audio_row__duration audio_row__duration-s _audio_row__duration">3:59</div></div>
    </div>

这意味着我得到了正确的页面，但是 bs4 中的 decode func 无法检测到这些符号（我在使用 errors='strict' 执行它时收到此消息）：

Traceback (most recent call last):
  File "/home/roman/VKMusic/ParserEXE.py", line 5, in <module>
    use_dict = pairing(html_path=html_path,errors='strict')
  File "/home/roman/VKMusic/Main1.py", line 17, in pairing
    soup = inside(html_path=html_path, errors=errors)
  File "/home/roman/VKMusic/Main1.py", line 10, in inside
    soup = BeautifulSoup(fp, features='lxml')
  File "/home/roman/anaconda3/lib/python3.8/site-packages/bs4/__init__.py", line 306, in __init__
    markup = markup.read()
  File "/home/roman/anaconda3/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 457: invalid continuation byte

本例中的

HTML 页面是由我自己和 Windows Google Chrome（在 Linux(Ubuntu) 下载的7 由我的朋友。两者都有相同的结果），但我也尝试使用 Firefox 和运行进入此错误。
我需要我的代码解析整个 html，包括西里尔符号。

Link 至 html 页面示例：https://drive.google.com/file/d/1FKhTlVErjAKI9L2iedJtmHaXpoyCBMdl/view?usp=sharing

Answer 1

在函数'inside'中替换
with open(html_path, errors=errors) as fp:

至

with open(html_path, errors=errors, encoding='cp1251') as fp:

输出：

...

'黑湖': '6 天后', 'Black Lakes and Maria Nemtseva'：'放手'， 'Charles Gounod'：'Opera Faust'，Walpurgis Night - Antique Dance [“面具” '歌剧 - 2"] ', '肖邦'：'C小调练习曲“革命”第12号'， 'hyperborea'：'先行者'， '♫通灵王'：'环顾四周，回头看，灵魂与你联系' '想。世界不是它看起来的样子，每一个奇迹 '被...围绕。周围的一切都受制于眼睛，做出你的选择' '必须你自己，迎接你的命运 - 成为一名萨满。国王， ' '所有萨满，国王如果q'}

解析 html 页面的一些问题

Some problems with parsing html pages

python

unicode

parsing

beautifulsoup

html-parsing