utf-16-le BOM csv 文件
utf-16-le BOM csv files
我正在从 Playstore(统计信息等)下载一些 CSV 文件并希望使用 python 进行处理。
cromestant@jumphost-vpc:~/stat_dev/bime$ file -bi stats/installs/*
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
如您所见,它们是 utf-16le。
我在 python 2.7 上有一些代码适用于某些文件而不适用于其他文件:
import codecs
.
.
fp =codecs.open(dir_n+'/'+file_n,'r',"utf-16")
for line in fp:
#write to mysql db
这一直有效到:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 10: ordinal not in range(128)
正确的做法是什么?我见过 "re encode" 使用 cvs 模块等,但 csv 模块本身不处理编码,因此仅转储到数据库似乎有点过分
What is the proper way to do this?
正确的方法是使用 Python3,其中对 Unicode 的支持要合理得多。
作为变通方法,如果您出于某种原因对 Python3 过敏,最好的折衷办法是将 csv.reader()
包裹起来,如下所示:
import codecs
import csv
def to_utf8(fp):
for line in fp:
yield line.encode("utf-8")
def from_utf8(fp):
for line in fp:
yield [column.decode('utf-8') for column in line]
with codecs.open('utf16le.csv','r', 'utf-16le') as fp:
reader = from_utf8(csv.reader(to_utf8(fp)))
for line in reader:
#"line" is a list of unicode strings
#write to mysql db
print line
你试过了吗codecs.EncodedFile
?
with open('x.csv', 'rb') as f:
g = codecs.EncodedFile(f, 'utf8', 'utf-16le', 'ignore')
c = csv.reader(g)
for row in c:
print row
# and if you want to use unicode instead of str:
row = [unicode(cell, 'utf8') for cell in row]
我正在从 Playstore(统计信息等)下载一些 CSV 文件并希望使用 python 进行处理。
cromestant@jumphost-vpc:~/stat_dev/bime$ file -bi stats/installs/*
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
如您所见,它们是 utf-16le。
我在 python 2.7 上有一些代码适用于某些文件而不适用于其他文件:
import codecs
.
.
fp =codecs.open(dir_n+'/'+file_n,'r',"utf-16")
for line in fp:
#write to mysql db
这一直有效到:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 10: ordinal not in range(128)
正确的做法是什么?我见过 "re encode" 使用 cvs 模块等,但 csv 模块本身不处理编码,因此仅转储到数据库似乎有点过分
What is the proper way to do this?
正确的方法是使用 Python3,其中对 Unicode 的支持要合理得多。
作为变通方法,如果您出于某种原因对 Python3 过敏,最好的折衷办法是将 csv.reader()
包裹起来,如下所示:
import codecs
import csv
def to_utf8(fp):
for line in fp:
yield line.encode("utf-8")
def from_utf8(fp):
for line in fp:
yield [column.decode('utf-8') for column in line]
with codecs.open('utf16le.csv','r', 'utf-16le') as fp:
reader = from_utf8(csv.reader(to_utf8(fp)))
for line in reader:
#"line" is a list of unicode strings
#write to mysql db
print line
你试过了吗codecs.EncodedFile
?
with open('x.csv', 'rb') as f:
g = codecs.EncodedFile(f, 'utf8', 'utf-16le', 'ignore')
c = csv.reader(g)
for row in c:
print row
# and if you want to use unicode instead of str:
row = [unicode(cell, 'utf8') for cell in row]