Python 抓取编码问题
Python scraping encoding issues
我正在尝试使用 beautifulsoup 抓取网站。我基本上很成功,但有两个问题
从网站获取数据后,我将它们打印到屏幕上
将它们写入 CSV 文件。网站上有一个价格字段,里面有
实际金额的卢比符号(价格的样本结构
字段:10000 卢比)。当我将金额打印到控制台时,它打印得很好
没有问题。当我尝试将其写入 excel sheet 时,出现错误
"Unicodeencoeerror" 编解码器 'charmap' 无法对字符 '\u20b9' 进行编码
位置 28。我正在打印其他字段以进行控制台,excel 问题显示
仅包含两个字段,一个带有货币符号,另一个带有 *
符号
我有一个循环 运行 从特定网页获取所有页面
搜索。搜索结果约为 344 页,但循环停止在大约页
43 只有 HTML 错误 500 作为错误消息
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as Soup
filename = "data.csv"
f = open(filename,"w")
headers = "phone_name, phone_price, phone_rating,number_of_ratings,
memory, display, camera, battery, processor, Warrenty, security, OS\n"
f.write(headers)
for i in range(2): # Number of pages minus one
my_url = 'https://www.flipkart.com/search?as=off&as-
show=on&otracker=start&page=
{}&q=cell+phones&viewType=list'.format(i+1)
print(my_url)
uClient=uReq(my_url)
page_html=uClient.read()
page_soup = Soup(page_html,"html.parser")
containers=page_soup.findAll("a", {"class":"_1UoZlX"})
for container in containers: phone_name =
container.find("div",{"class":"_3wU53n"}).text
try:
phone_price = container.find("div",{"class":"_1vC4OE _2rQ-NK"}).text
except:
phone_price = 'No Data'
非常感谢您的帮助!
为 Excel 编写 .CSV 文件时,应使用 utf-8-sig
编码以正确支持任何 Unicode 字符。如果仅使用 utf8
并且显示字符不正确,Excel 将采用 Windows 上的本地化 ANSI 编码。
#!python3
import csv
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as Soup
filename = "data.csv"
with open(filename,'w',newline='',encoding='utf-8-sig') as f:
w = csv.writer(f)
headers = 'phone_name phone_price phone_rating number_of_ratings memory display camera battery processor Warrenty security OS'
w.writerow(headers.split())
for i in range(2): # Number of pages minus one
my_url = 'https://www.flipkart.com/search?as=off&as-show=on&otracker=start&page={}&q=cell+phones&viewType=list'.format(i+1)
print(my_url)
uClient=uReq(my_url)
page_html=uClient.read()
page_soup = Soup(page_html,"html.parser")
containers=page_soup.findAll("a", {"class":"_1UoZlX"})
for container in containers:
phone_name = container.find("div",{"class":"_3wU53n"}).text
try:
phone_price = container.find("div",{"class":"_1vC4OE _2rQ-NK"}).text
except:
phone_price = 'No Data'
w.writerow([phone_name,phone_price])
输出:
phone_name,phone_price,phone_rating,number_of_ratings,memory,display,camera,battery,processor,Warrenty,security,OS
"Asus Zenfone 3 Laser (Gold, 32 GB)","₹9,999"
"Intex Aqua Style III (Champagne/Champ, 16 GB)","₹3,999"
"iVooMi i1s (Platinum Gold, 32 GB)","₹7,499"
"Xolo ERA 3X (Posh Black, 16 GB)","₹6,999"
"iVooMi Me1 (Sunshine Gold, 8 GB)","₹3,599"
"Panasonic Eluga A4 (Mocha Gold, 32 GB)","₹9,790"
Samsung Metro 313 Dual Sim,"₹2,025"
"Samsung Galaxy J3 Pro (Gold, 16 GB)","₹6,990"
Samsung Guru Music 2,"₹1,625"
"Panasonic Eluga A4 (Marine Blue, 32 GB)","₹9,640"
"Asus Zenfone 4 Selfie (Black, 32 GB)","₹9,999"
Swipe Elite 3- 4G with VoLTE,"₹3,999"
"Asus Zenfone Max (Black, 16 GB)","₹7,486"
Swipe Elite 3- 4G with VoLTE,"₹3,999"
"Swipe Elite Power (Space Grey, 16 GB)","₹5,499"
"Celkon Diamond Mega (Grey, 16 GB)","₹5,499"
"Asus Zenfone Max (Black, 32 GB)","₹7,999"
"Swipe Elite Power (Champagne Gold, 16 GB)","₹5,499"
"Asus Zenfone 4 Selfie (Gold, 32 GB)","₹9,999"
"Karbonn Aura (Champagne, 8 GB)","₹3,199"
"Infinix Note 4 (Ice Blue, 32 GB)","₹8,999"
"Infinix Note 4 (Milan Black, 32 GB)","₹8,999"
"Moto G5s Plus (Blush Gold, 64 GB)","₹15,990"
"Moto G5s Plus (Lunar Grey, 64 GB)","₹15,940"
Excel:
我正在尝试使用 beautifulsoup 抓取网站。我基本上很成功,但有两个问题
从网站获取数据后,我将它们打印到屏幕上 将它们写入 CSV 文件。网站上有一个价格字段,里面有 实际金额的卢比符号(价格的样本结构 字段:10000 卢比)。当我将金额打印到控制台时,它打印得很好 没有问题。当我尝试将其写入 excel sheet 时,出现错误 "Unicodeencoeerror" 编解码器 'charmap' 无法对字符 '\u20b9' 进行编码 位置 28。我正在打印其他字段以进行控制台,excel 问题显示 仅包含两个字段,一个带有货币符号,另一个带有 * 符号
我有一个循环 运行 从特定网页获取所有页面 搜索。搜索结果约为 344 页,但循环停止在大约页 43 只有 HTML 错误 500 作为错误消息
import bs4 from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as Soup filename = "data.csv" f = open(filename,"w") headers = "phone_name, phone_price, phone_rating,number_of_ratings, memory, display, camera, battery, processor, Warrenty, security, OS\n" f.write(headers) for i in range(2): # Number of pages minus one my_url = 'https://www.flipkart.com/search?as=off&as- show=on&otracker=start&page= {}&q=cell+phones&viewType=list'.format(i+1) print(my_url) uClient=uReq(my_url) page_html=uClient.read() page_soup = Soup(page_html,"html.parser") containers=page_soup.findAll("a", {"class":"_1UoZlX"}) for container in containers: phone_name = container.find("div",{"class":"_3wU53n"}).text try: phone_price = container.find("div",{"class":"_1vC4OE _2rQ-NK"}).text except: phone_price = 'No Data'
非常感谢您的帮助!
为 Excel 编写 .CSV 文件时,应使用 utf-8-sig
编码以正确支持任何 Unicode 字符。如果仅使用 utf8
并且显示字符不正确,Excel 将采用 Windows 上的本地化 ANSI 编码。
#!python3
import csv
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as Soup
filename = "data.csv"
with open(filename,'w',newline='',encoding='utf-8-sig') as f:
w = csv.writer(f)
headers = 'phone_name phone_price phone_rating number_of_ratings memory display camera battery processor Warrenty security OS'
w.writerow(headers.split())
for i in range(2): # Number of pages minus one
my_url = 'https://www.flipkart.com/search?as=off&as-show=on&otracker=start&page={}&q=cell+phones&viewType=list'.format(i+1)
print(my_url)
uClient=uReq(my_url)
page_html=uClient.read()
page_soup = Soup(page_html,"html.parser")
containers=page_soup.findAll("a", {"class":"_1UoZlX"})
for container in containers:
phone_name = container.find("div",{"class":"_3wU53n"}).text
try:
phone_price = container.find("div",{"class":"_1vC4OE _2rQ-NK"}).text
except:
phone_price = 'No Data'
w.writerow([phone_name,phone_price])
输出:
phone_name,phone_price,phone_rating,number_of_ratings,memory,display,camera,battery,processor,Warrenty,security,OS
"Asus Zenfone 3 Laser (Gold, 32 GB)","₹9,999"
"Intex Aqua Style III (Champagne/Champ, 16 GB)","₹3,999"
"iVooMi i1s (Platinum Gold, 32 GB)","₹7,499"
"Xolo ERA 3X (Posh Black, 16 GB)","₹6,999"
"iVooMi Me1 (Sunshine Gold, 8 GB)","₹3,599"
"Panasonic Eluga A4 (Mocha Gold, 32 GB)","₹9,790"
Samsung Metro 313 Dual Sim,"₹2,025"
"Samsung Galaxy J3 Pro (Gold, 16 GB)","₹6,990"
Samsung Guru Music 2,"₹1,625"
"Panasonic Eluga A4 (Marine Blue, 32 GB)","₹9,640"
"Asus Zenfone 4 Selfie (Black, 32 GB)","₹9,999"
Swipe Elite 3- 4G with VoLTE,"₹3,999"
"Asus Zenfone Max (Black, 16 GB)","₹7,486"
Swipe Elite 3- 4G with VoLTE,"₹3,999"
"Swipe Elite Power (Space Grey, 16 GB)","₹5,499"
"Celkon Diamond Mega (Grey, 16 GB)","₹5,499"
"Asus Zenfone Max (Black, 32 GB)","₹7,999"
"Swipe Elite Power (Champagne Gold, 16 GB)","₹5,499"
"Asus Zenfone 4 Selfie (Gold, 32 GB)","₹9,999"
"Karbonn Aura (Champagne, 8 GB)","₹3,199"
"Infinix Note 4 (Ice Blue, 32 GB)","₹8,999"
"Infinix Note 4 (Milan Black, 32 GB)","₹8,999"
"Moto G5s Plus (Blush Gold, 64 GB)","₹15,990"
"Moto G5s Plus (Lunar Grey, 64 GB)","₹15,940"
Excel: