解析 html 页面的一些问题

Some problems with parsing html pages

我正在尝试构建程序来解析 html 页面。在这种情况下,我无法直接获得 url,所以我要求用户下载一个 html 页面并使用它。

# -*- coding: UTF-8 -*-

from re import findall
from bs4 import BeautifulSoup


# INPUT
def inside(html_path, errors='ignore'):
    with open(html_path, errors=errors) as fp:
        soup = BeautifulSoup(fp, features='lxml')
    return soup


def pairing(html_path, errors='ignore') -> dict:
    use_dict = {}

    soup = inside(html_path=html_path, errors=errors)

    for pair in zip(
            soup.find_all('div', {"class": "audio_row__performers"}),
            soup.find_all('span', {"class": "audio_row__title_inner _audio_row__title_inner"}),
            soup.find_all('span', {"class": "audio_row__title_inner_subtitle _audio_row__title_inner_subtitle"})
    ):
        """
        pair[0] - musician(-s), 
        pair[1] - track_name, 
        pair[2] - subtitle for track(if any)
        """

        track_author = pair[0].find('a').text

        pair_2_str = str(pair[2])

        regex = "(?<=>).*?(?=<)"
        add_meta = findall(regex, pair_2_str)[0]

        track_name = pair[1].text + f" {add_meta}"

        use_dict.update({track_author: track_name})
    return use_dict

在使用 errors='replace' 执行后,我得到这样的结果:

('The Offspring', 'Dividing By Zero ')
('ACDC', 'Hightway to Hell ')
('����(�.�. ���)', '������ ����� �� ������ ')
('Haddaway', "What is love, baby don't hurt me. ")
("Guns N' Roses", 'Catcher in the rye ')
('Queen', 'Dont stop me now (�������� � ������)  ')
('The Subways', 'Rock & Roll Queen ')
('Fetty Wap', 'Trap Queen ')

我以为我找错了页面,所以我查看了元数据,不幸的是,我发现了这个:

      <div class="audio_row__performer_title">
        <div onmouseover="setTitle(this)" class="audio_row__performers"><a href="https://vk.com/audio?performer=1&amp;q=%D0%9A%D0%B8%D0%BD%D0%BE%28%D0%92.%D0%A0.%20%D0%A6%D0%BE%D0%B9%29">Кино(В.Р. Цой)</a></div>
        <div class="audio_row__title _audio_row__title" onmouseover="setTitle(this)">
          <span class="audio_row__title_inner _audio_row__title_inner">Группа крови на рукаве</span>
          <span class="audio_row__title_inner_subtitle _audio_row__title_inner_subtitle"></span>
        </div>
      </div>
      <div class="audio_row__info _audio_row__info"><div class="audio_row__duration audio_row__duration-s _audio_row__duration">3:59</div></div>
    </div>

这意味着我得到了正确的页面,但是 bs4 中的 decode func 无法检测到这些符号(我在使用 errors='strict' 执行它时收到此消息):

Traceback (most recent call last):
  File "/home/roman/VKMusic/ParserEXE.py", line 5, in <module>
    use_dict = pairing(html_path=html_path,errors='strict')
  File "/home/roman/VKMusic/Main1.py", line 17, in pairing
    soup = inside(html_path=html_path, errors=errors)
  File "/home/roman/VKMusic/Main1.py", line 10, in inside
    soup = BeautifulSoup(fp, features='lxml')
  File "/home/roman/anaconda3/lib/python3.8/site-packages/bs4/__init__.py", line 306, in __init__
    markup = markup.read()
  File "/home/roman/anaconda3/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 457: invalid continuation byte
本例中的

HTML 页面是由我自己和 Windows Google Chrome(在 Linux(Ubuntu) 下载的7 由我的朋友。两者都有相同的结果),但我也尝试使用 Firefox 和 运行 进入此错误。
我需要我的代码解析整个 html,包括西里尔符号。

Link 至 html 页面示例:https://drive.google.com/file/d/1FKhTlVErjAKI9L2iedJtmHaXpoyCBMdl/view?usp=sharing

在函数'inside'中替换
with open(html_path, errors=errors) as fp:

with open(html_path, errors=errors, encoding='cp1251') as fp:

输出:

...

'黑湖': '6 天后', 'Black Lakes and Maria Nemtseva':'放手', 'Charles Gounod':'Opera Faust',Walpurgis Night - Antique Dance [“面具” '歌剧 - 2"] ', '肖邦':'C小调练习曲“革命”第12号', 'hyperborea':'先行者', '♫通灵王':'环顾四周,回头看,灵魂与你联系' '想。世界不是它看起来的样子,每一个奇迹 '被...围绕。周围的一切都受制于眼睛,做出你的选择' '必须你自己,迎接你的命运 - 成为一名萨满。国王, ' '所有萨满,国王如果q'}