Cleaning/decoding 在 beautifulsoup 中抓取的文本

Question

我在抓取后将文本转换为 csv 时遇到问题。关键是，.csv 文件中有一些法文字母以“©”、“Ă‰”等结尾。我怎样才能解码它，使它们以英文字母的形式出现？或者被正确地抓取到文件中？

from urllib.request import urlopen
from bs4 import BeautifulSoup as BS
from urllib import request
import pandas as pd
import os
import re
html = request.urlopen("https://en.wikipedia.org/wiki/Jean_Dieudonn%C3%A9")
bs = BS(html.read(), 'html.parser')
    
data = pd.DataFrame({'name':[],'known for':[],)}
    try:
        name = bs.find('h1').text
    except:
        name = ''
    try:
        known= bs.select_one('th:contains("Known")').next_sibling.get_text('\n').split('\n') #ends up with even more weird signs
    except:
        known = ''
x = {'name': name, 'known for': known}
data = data.append(x, ignore_index = True)
data.to_csv('files.csv', sep=",", index=True)

感谢任何想法

Answer 1

您可以简单地用 utf-8-sig:

编码您的数据

data.to_csv('files.csv', sep=",", index=True, encoding='utf-8-sig')

来自What is the difference between utf-8 and utf-8-sig?：

"sig" in "utf-8-sig" is the abbreviation of "signature" (i.e. signature utf-8 file). Using utf-8-sig to read a file will treat BOM (Byte order mark) as file info. instead of a string.

Cleaning/decoding 在 beautifulsoup 中抓取的文本

Cleaning/decoding scraped text in beautifulsoup

python

screen-scraping

decoding