Cleaning/decoding 在 beautifulsoup 中抓取的文本
Cleaning/decoding scraped text in beautifulsoup
我在抓取后将文本转换为 csv 时遇到问题。关键是,.csv 文件中有一些法文字母以“©”、“É”等结尾。我怎样才能解码它,使它们以英文字母的形式出现?或者被正确地抓取到文件中?
from urllib.request import urlopen
from bs4 import BeautifulSoup as BS
from urllib import request
import pandas as pd
import os
import re
html = request.urlopen("https://en.wikipedia.org/wiki/Jean_Dieudonn%C3%A9")
bs = BS(html.read(), 'html.parser')
data = pd.DataFrame({'name':[],'known for':[],)}
try:
name = bs.find('h1').text
except:
name = ''
try:
known= bs.select_one('th:contains("Known")').next_sibling.get_text('\n').split('\n') #ends up with even more weird signs
except:
known = ''
x = {'name': name, 'known for': known}
data = data.append(x, ignore_index = True)
data.to_csv('files.csv', sep=",", index=True)
感谢任何想法
您可以简单地用 utf-8-sig
:
编码您的数据
data.to_csv('files.csv', sep=",", index=True, encoding='utf-8-sig')
来自What is the difference between utf-8 and utf-8-sig?:
"sig"
in "utf-8-sig"
is the abbreviation of "signature" (i.e. signature utf-8
file). Using utf-8-sig
to read a file will treat BOM (Byte order mark) as file info. instead of a string.
我在抓取后将文本转换为 csv 时遇到问题。关键是,.csv 文件中有一些法文字母以“©”、“É”等结尾。我怎样才能解码它,使它们以英文字母的形式出现?或者被正确地抓取到文件中?
from urllib.request import urlopen
from bs4 import BeautifulSoup as BS
from urllib import request
import pandas as pd
import os
import re
html = request.urlopen("https://en.wikipedia.org/wiki/Jean_Dieudonn%C3%A9")
bs = BS(html.read(), 'html.parser')
data = pd.DataFrame({'name':[],'known for':[],)}
try:
name = bs.find('h1').text
except:
name = ''
try:
known= bs.select_one('th:contains("Known")').next_sibling.get_text('\n').split('\n') #ends up with even more weird signs
except:
known = ''
x = {'name': name, 'known for': known}
data = data.append(x, ignore_index = True)
data.to_csv('files.csv', sep=",", index=True)
感谢任何想法
您可以简单地用 utf-8-sig
:
data.to_csv('files.csv', sep=",", index=True, encoding='utf-8-sig')
来自What is the difference between utf-8 and utf-8-sig?:
"sig"
in"utf-8-sig"
is the abbreviation of "signature" (i.e. signatureutf-8
file). Usingutf-8-sig
to read a file will treat BOM (Byte order mark) as file info. instead of a string.