Urllib 中的西里尔编码 Python 3.5

Question

我正在使用 Python 3.5 和 Anaconda 2.4.0，并尝试使用 urllib 和 BeautifulSoup 解析站点。我写了一个简单的代码，但它显示了西里尔符号的错误编码（html 页面 windows-1251 编码）所以显示类似的东西：

[<td align="center" widh="30"><a href="/registration/"><img alt="\xd0\xa0\xd0\xb5\xd0\xb3\xd0\xb8\xd1\x81\xd1\x82\xd1\x80\xd0\xb0\xd1\x86\xd0\xb8\xd1\x8f" border="0" src="/images/pers.png"/></a></td>] 等等

我尝试了很多方法对此进行编码，但都失败了。你能帮帮我吗？

提前致谢。

import urllib.request
from bs4 import BeautifulSoup



def get_html(url):
    response = urllib.request.urlopen(url)
    return response.read()


def parse(html):
    soup = BeautifulSoup(html, 'lxml')
    table = soup.find('table')


    for row in table.find_all('tr')[1:]:
        cols=row.find_all('td')
        print(str(cols).encode('utf-8'))


def main():
    parse(get_html('http://www.prof-volos.ru/hair/shampoo/damaged/sale/1/'))


if __name__ == '__main__':
    main()

Answer 1

这对我有用 python3.4:

import urllib.request
from bs4 import BeautifulSoup

def get_html(url):
    response = urllib.request.urlopen(url)
    return response.read()

def parse(html):
    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find('table')

    for row in table.find_all('tr')[1:]:
        for col in row.find_all('td'):
            print(col.text)

def main():
    parse(get_html('http://www.prof-volos.ru/hair/shampoo/damaged/sale/1/'))

if __name__ == '__main__':
    main()

为什么这样做有效？

我认为您是立即解码 Beautiful Soup 结果集列表，而不是将标签与文本分开。

Beautiful Soup 执行 UnicodeDammit 检测 CP1251 并自动转换为 UTF8。

我做了 2 处更改：

使用 html.parser 而不是 lxml（我不确定这是否重要）。

打印每个 col 的文本，而不是直接使用 row.find_all 结果。

Answer 2

如果这个答案对原发布者有效，我将删除我的另一个答案。我现在怀疑 Python 脚本（使用 UTF-8）和 Windows（使用其他编码）之间存在编码交互。建议的解决方案：将输出写入文件。

import urllib.request
from bs4 import BeautifulSoup

def get_html(url):
    response = urllib.request.urlopen(url)
    return response.read()

def parse(html):
    lines = []
    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find('table')

    for row in table.find_all('tr')[1:]:
        for col in row.find_all('td'):
            lines.append(col.text)
    return(lines)

def main():
    url = 'http://www.prof-volos.ru/hair/shampoo/damaged/sale/1/'
    with open('ThisFileWillBeBlindlyOverwritten.txt', 'w') as f:
        for line in parse(get_html(url)):
            f.write(u'{}'.format(line))

if __name__ == '__main__':
    main()

Urllib 中的西里尔编码 Python 3.5

Cyrillic Encoding in Urllib Python 3.5

python

parsing

urllib

beautifulsoup

anaconda