Python / Pandas: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 133: invalid continuation byte
Python / Pandas: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 133: invalid continuation byte
我正在尝试构建一种方法来导入多种类型的 csvs 或 Excel 并对其进行标准化。一切都很顺利 运行 直到某个 csv 出现,这给我带来了这个错误:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 133: invalid continuation byte
我正在构建一组 try/excepts 以包含数据类型的变体,但对于这个我不知道如何防止。
if csv_or_excel_path[-3:]=='csv':
try: table=pd.read_csv(csv_or_excel_path)
except:
try: table=pd.read_csv(csv_or_excel_path,sep=';')
except:
try:table=pd.read_csv(csv_or_excel_path,sep='\t')
except:
try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8')
except:
try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8',sep=';')
except: table=pd.read_csv(csv_or_excel_path,encoding='utf-8',sep='\t')
顺便说一句,文件的分隔符是“;”。
所以:
a) 我明白,如果我能识别出 "position 133" 中的字符是什么,那么追踪问题会更容易,但我不确定如何找到它。有什么建议吗?
b) 是否有人建议在 try/except 序列中包含哪些内容以跳过此问题?
感谢@woblers 和@FHTMitchell 的支持。问题是 CSV 有一个奇怪的结果:ISO-8859-1。
我通过在 try/except 序列中添加几行来修复它。在这里你可以看到它的完整版本。
if csv_or_excel_path[-3:]=='csv':
try: table=pd.read_csv(csv_or_excel_path)
except:
try: table=pd.read_csv(csv_or_excel_path,sep=';')
except:
try:table=pd.read_csv(csv_or_excel_path,sep='\t')
except:
try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8')
except:
try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8',sep=';')
except:
try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8',sep='\t')
except:
try:table=pd.read_csv(csv_or_excel_path,encoding = "ISO-8859-1", sep=";")
except:
try: table=pd.read_csv(csv_or_excel_path,encoding = "ISO-8859-1", sep=";")
except: table=pd.read_csv(csv_or_excel_path,encoding = "ISO-8859-1", sep="\t")
郑重声明,这可能比多个 try/except
s
def read_csv(filepath):
if os.path.splitext(filepath)[1] != '.csv':
return # or whatever
seps = [',', ';', '\t'] # ',' is default
encodings = [None, 'utf-8', 'ISO-8859-1'] # None is default
for sep in seps:
for encoding in encodings:
try:
return pd.read_csv(filepath, encoding=encoding, sep=sep)
except Exception: # should really be more specific
pass
raise ValueError("{!r} is has no encoding in {} or seperator in {}"
.format(filepath, encodings, seps))
另一种可能是
with open(path_to_file, encoding="utf8", errors="ignore") as f:
table = pd.read_csv(f, sep=";")
默认情况下,errors="ignore"
将忽略来自 read()
调用的有问题的字节序列。您还可以为此类字节序列提供填充值。但总的来说,这应该减少大量痛苦的错误处理和嵌套的 try-excepts 的需要。
我正在尝试构建一种方法来导入多种类型的 csvs 或 Excel 并对其进行标准化。一切都很顺利 运行 直到某个 csv 出现,这给我带来了这个错误:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 133: invalid continuation byte
我正在构建一组 try/excepts 以包含数据类型的变体,但对于这个我不知道如何防止。
if csv_or_excel_path[-3:]=='csv':
try: table=pd.read_csv(csv_or_excel_path)
except:
try: table=pd.read_csv(csv_or_excel_path,sep=';')
except:
try:table=pd.read_csv(csv_or_excel_path,sep='\t')
except:
try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8')
except:
try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8',sep=';')
except: table=pd.read_csv(csv_or_excel_path,encoding='utf-8',sep='\t')
顺便说一句,文件的分隔符是“;”。
所以:
a) 我明白,如果我能识别出 "position 133" 中的字符是什么,那么追踪问题会更容易,但我不确定如何找到它。有什么建议吗?
b) 是否有人建议在 try/except 序列中包含哪些内容以跳过此问题?
感谢@woblers 和@FHTMitchell 的支持。问题是 CSV 有一个奇怪的结果:ISO-8859-1。
我通过在 try/except 序列中添加几行来修复它。在这里你可以看到它的完整版本。
if csv_or_excel_path[-3:]=='csv':
try: table=pd.read_csv(csv_or_excel_path)
except:
try: table=pd.read_csv(csv_or_excel_path,sep=';')
except:
try:table=pd.read_csv(csv_or_excel_path,sep='\t')
except:
try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8')
except:
try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8',sep=';')
except:
try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8',sep='\t')
except:
try:table=pd.read_csv(csv_or_excel_path,encoding = "ISO-8859-1", sep=";")
except:
try: table=pd.read_csv(csv_or_excel_path,encoding = "ISO-8859-1", sep=";")
except: table=pd.read_csv(csv_or_excel_path,encoding = "ISO-8859-1", sep="\t")
郑重声明,这可能比多个 try/except
s
def read_csv(filepath):
if os.path.splitext(filepath)[1] != '.csv':
return # or whatever
seps = [',', ';', '\t'] # ',' is default
encodings = [None, 'utf-8', 'ISO-8859-1'] # None is default
for sep in seps:
for encoding in encodings:
try:
return pd.read_csv(filepath, encoding=encoding, sep=sep)
except Exception: # should really be more specific
pass
raise ValueError("{!r} is has no encoding in {} or seperator in {}"
.format(filepath, encodings, seps))
另一种可能是
with open(path_to_file, encoding="utf8", errors="ignore") as f:
table = pd.read_csv(f, sep=";")
默认情况下,errors="ignore"
将忽略来自 read()
调用的有问题的字节序列。您还可以为此类字节序列提供填充值。但总的来说,这应该减少大量痛苦的错误处理和嵌套的 try-excepts 的需要。