由于 Python 中的 ascii 错误,将数据写入 CSV 时出错
Error writing data to CSV due to ascii error in Python
import requests
from bs4 import BeautifulSoup
import csv
from urlparse import urljoin
import urllib2
base_url = 'http://www.baseball-reference.com'
data = requests.get("http://www.baseball-reference.com/teams/BAL/2014-schedule-scores.shtml")
soup = BeautifulSoup(data.content)
outfile = open("./Balpbp.csv", "wb")
writer = csv.writer(outfile)
url = []
for link in soup.find_all('a'):
if not link.has_attr('href'):
continue
if link.get_text() != 'boxscore':
continue
url.append(base_url + link['href'])
for list in url:
response = requests.get(list)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'id': 'play_by_play'})
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
writer.writerows(list_of_rows)
u'G.\xa0Holland', u'N.\xa0Cruz'...
错误信息如下:
Traceback (most recent call last):
File "try.py", line 40, in <module>
writer.writerows(list_of_rows)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 57: ordinal not in range(128)
当我将数据写入 csv 时,我最终得到的数据包含 \x... 数据块中的内容,这会阻止将数据写入 csv。我该如何更改数据以删除这部分数据或采取措施来规避此问题?
您不能将 unicode 与带有 python2 的 csv 模块一起使用,您需要 encode
字符串:
备注
This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
text = cell.text.replace(' ', '').encode("utf-8")
编码后输出:
Top of the 1st, Red Sox Batting, Tied 0-0, Orioles' Chris Tillman facing 1-2-3
"
t1,0-0,0,---,"7,(2-2) CBBFFFX",O,BOS,D. Nava,C. Tillman,2%,52%,Groundout: P-1B (P's Right)
t1,0-0,1,---,"4,(1-2) BCFX",,BOS,D. Pedroia,C. Tillman,-2%,50%,Single to RF (Line Drive to Short RF)
t1,0-0,1,1--,"5,(1-2) CFBFT",O,BOS,D. Ortiz,C. Tillman,3%,52%,Strikeout Swinging
t1,0-0,2,1--,"4,(0-2) C1CFS",O,BOS,M. Napoli,C. Tillman,2%,55%,Strikeout Swinging
,,,,,,,,,"0 runs, 1 hit, 0 errors, 1 LOB. Red Sox 0, Orioles 0."
"Bottom of the 1st, Orioles Batting, Tied 0-0, Red Sox' Jon Lester facing 1-2-3
"
b1,0-0,0,---,"4,(1-2) CBFX",O,BAL,N. Markakis,J. Lester,-2%,52%,Groundout: 3B-1B (Weak 3B)
b1,0-0,1,---,"6,(3-2) BBFFBX",,BAL,J. Hardy,J. Lester,2%,55%,Single to LF (Line Drive)
b1,0-0,1,1--,"4,(1-2) FBSX",O,BAL,A. Jones,J. Lester,-3%,52%,Popfly: SS (Deep SS)
b1,0-0,2,1--,"5,(1-2) FFBFS",O,BAL,C. Davis,J. Lester,-2%,50%,Strikeout Swinging
....................................
import requests
from bs4 import BeautifulSoup
import csv
from urlparse import urljoin
import urllib2
base_url = 'http://www.baseball-reference.com'
data = requests.get("http://www.baseball-reference.com/teams/BAL/2014-schedule-scores.shtml")
soup = BeautifulSoup(data.content)
outfile = open("./Balpbp.csv", "wb")
writer = csv.writer(outfile)
url = []
for link in soup.find_all('a'):
if not link.has_attr('href'):
continue
if link.get_text() != 'boxscore':
continue
url.append(base_url + link['href'])
for list in url:
response = requests.get(list)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'id': 'play_by_play'})
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
writer.writerows(list_of_rows)
u'G.\xa0Holland', u'N.\xa0Cruz'...
错误信息如下:
Traceback (most recent call last):
File "try.py", line 40, in <module>
writer.writerows(list_of_rows)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 57: ordinal not in range(128)
当我将数据写入 csv 时,我最终得到的数据包含 \x... 数据块中的内容,这会阻止将数据写入 csv。我该如何更改数据以删除这部分数据或采取措施来规避此问题?
您不能将 unicode 与带有 python2 的 csv 模块一起使用,您需要 encode
字符串:
备注
This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
text = cell.text.replace(' ', '').encode("utf-8")
编码后输出:
Top of the 1st, Red Sox Batting, Tied 0-0, Orioles' Chris Tillman facing 1-2-3
"
t1,0-0,0,---,"7,(2-2) CBBFFFX",O,BOS,D. Nava,C. Tillman,2%,52%,Groundout: P-1B (P's Right)
t1,0-0,1,---,"4,(1-2) BCFX",,BOS,D. Pedroia,C. Tillman,-2%,50%,Single to RF (Line Drive to Short RF)
t1,0-0,1,1--,"5,(1-2) CFBFT",O,BOS,D. Ortiz,C. Tillman,3%,52%,Strikeout Swinging
t1,0-0,2,1--,"4,(0-2) C1CFS",O,BOS,M. Napoli,C. Tillman,2%,55%,Strikeout Swinging
,,,,,,,,,"0 runs, 1 hit, 0 errors, 1 LOB. Red Sox 0, Orioles 0."
"Bottom of the 1st, Orioles Batting, Tied 0-0, Red Sox' Jon Lester facing 1-2-3
"
b1,0-0,0,---,"4,(1-2) CBFX",O,BAL,N. Markakis,J. Lester,-2%,52%,Groundout: 3B-1B (Weak 3B)
b1,0-0,1,---,"6,(3-2) BBFFBX",,BAL,J. Hardy,J. Lester,2%,55%,Single to LF (Line Drive)
b1,0-0,1,1--,"4,(1-2) FBSX",O,BAL,A. Jones,J. Lester,-3%,52%,Popfly: SS (Deep SS)
b1,0-0,2,1--,"5,(1-2) FFBFS",O,BAL,C. Davis,J. Lester,-2%,50%,Strikeout Swinging
....................................