Python 网页抓取编码
Python Webscraping Encoding
我似乎无法让程序识别 u'\xe9'(即“é”)。它似乎正在以 ascii 格式读取页面,这可能是问题所在。所以它无法打印 "coupé correctly." 有什么解决办法吗?
from lxml import html
import requests
new_list = []
page=requests.get('http://www.carfolio.com/specifications/models/?man=557')
tree=html.fromstring(page.text)
model_name = tree.xpath('//span[@class="model name"]/text()'.encode('utf-8'))
for elem in model_name:
new_list.append(elem)
if u'\xe9' in elem:
u'\xe9'.encode('latin-1')
print(elem)
我以前从未处理过编码问题。我可以轻松地删除包含该麻烦字节的元素,但那是删除我需要的数据。如果我切换编码,它会给我带来更奇怪的结果。
*python 3
替换
print(elem)
有
for char in elem:
print(bytes(char, 'latin-1').decode('latin-1'), end='')
print('')
或者
print(bytes(elem, 'latin-1').decode('latin-1'), end='')
from lxml import html
import requests
new_list = []
page=requests.get('http://www.carfolio.com/specifications/models/?man=557')
tree=html.fromstring(page.text)
model_name = tree.xpath('//span[@class="model name"]/text()'.encode('utf-8'))
print(len(model_name))
for elem in model_name:
for char in elem:
if "é" not in char:
print(char, end='')
print(' ')
这至少保留了相同数量的元素,只是忽略了 é 那个麻烦的野兽。
我似乎无法让程序识别 u'\xe9'(即“é”)。它似乎正在以 ascii 格式读取页面,这可能是问题所在。所以它无法打印 "coupé correctly." 有什么解决办法吗?
from lxml import html
import requests
new_list = []
page=requests.get('http://www.carfolio.com/specifications/models/?man=557')
tree=html.fromstring(page.text)
model_name = tree.xpath('//span[@class="model name"]/text()'.encode('utf-8'))
for elem in model_name:
new_list.append(elem)
if u'\xe9' in elem:
u'\xe9'.encode('latin-1')
print(elem)
我以前从未处理过编码问题。我可以轻松地删除包含该麻烦字节的元素,但那是删除我需要的数据。如果我切换编码,它会给我带来更奇怪的结果。
*python 3
替换
print(elem)
有
for char in elem:
print(bytes(char, 'latin-1').decode('latin-1'), end='')
print('')
或者
print(bytes(elem, 'latin-1').decode('latin-1'), end='')
from lxml import html
import requests
new_list = []
page=requests.get('http://www.carfolio.com/specifications/models/?man=557')
tree=html.fromstring(page.text)
model_name = tree.xpath('//span[@class="model name"]/text()'.encode('utf-8'))
print(len(model_name))
for elem in model_name:
for char in elem:
if "é" not in char:
print(char, end='')
print(' ')
这至少保留了相同数量的元素,只是忽略了 é 那个麻烦的野兽。