'utf-8' 编解码器无法解码位置 18 中的字节 0x92:起始字节无效
'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
我正在尝试读取名为 df1 的数据集,但它不起作用
import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";")
df1.head()
上面的代码有很多错误,但这是最相关的
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
数据确实没有编码为UTF-8;除了单个 0x92 字节外,所有内容都是 ASCII:
b'Korea, Dem. People\x92s Rep.'
将其解码为 Windows codepage 1252,其中 0x92 是花哨的引号,’
:
df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
sep=";", encoding='cp1252')
演示:
>>> import pandas as pd
>>> df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
... sep=";", encoding='cp1252')
>>> df1.head()
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 \
0 Afghanistan 55.1 55.5 55.9 56.2 56.6 57.0 57.4 57.8 58.2 58.6
1 Albania 74.3 74.7 75.2 75.5 75.8 76.1 76.3 76.5 76.7 76.8
2 Algeria 70.2 70.6 71.0 71.4 71.8 72.2 72.6 72.9 73.2 73.5
3 American Samoa .. .. .. .. .. .. .. .. .. ..
4 Andorra .. .. .. .. .. .. .. .. .. ..
2010 2011 2012 2013 Unnamed: 15 2014 2015
0 59.0 59.3 59.7 60.0 NaN 60.4 60.7
1 77.0 77.2 77.4 77.6 NaN 77.8 78.0
2 73.8 74.1 74.3 74.6 NaN 74.8 75.0
3 .. .. .. .. NaN .. ..
4 .. .. .. .. NaN .. ..
但是我注意到,当您从URL。当我将数据直接保存到磁盘时, 然后 使用 pd.read_csv()
加载它,数据被正确解码,但是从 URL 加载会产生 re-coded 数据:
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
>>> df1[' '][102].encode('cp1252').decode('utf8')
'Korea, Dem. People’s Rep.'
这是一个 known bug in Pandas. You can work around this by using urllib.request
来加载 URL 并将其传递给 pd.read_csv()
:
>>> import urllib.request
>>> with urllib.request.urlopen("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv") as resp:
... df1 = pd.read_csv(resp, sep=";", encoding='cp1252')
...
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
原来在 mac os 中创建的 csv 正在 windows machine 上被解析,我得到了 UnicodeDecodeError。
要消除此错误,请尝试将参数 encoding='mac-roman' 传递给 pandas 库的 read_csv 方法。
import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";", encoding='mac_roman')
df1.head()
输出:
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Unnamed: 15 2014 2015
0 Afghanistan 55.1 55.5 55.9 56.2 56.6 57.0 57.4 57.8 58.2 58.6 59.0 59.3 59.7 60.0 NaN 60.4 60.7
1 Albania 74.3 74.7 75.2 75.5 75.8 76.1 76.3 76.5 76.7 76.8 77.0 77.2 77.4 77.6 NaN 77.8 78.0
2 Algeria 70.2 70.6 71.0 71.4 71.8 72.2 72.6 72.9 73.2 73.5 73.8 74.1 74.3 74.6 NaN 74.8 75.0
3 American Samoa .. .. .. .. .. .. .. .. .. .. .. .. .. .. NaN .. ..
4 Andorra .. .. .. .. .. .. .. .. .. .. .. .. .. .. NaN .. ..
出现此问题是因为您的文件中有一些未知字符。
例如,在您使用 utf-8 编码的文件中,windows 1250 中有一些字符。
你应该删除或替换这些字符来解决你的问题
这个有效
df = pd.read_csv(inputfile, engine = 'python')
我正在尝试读取名为 df1 的数据集,但它不起作用
import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";")
df1.head()
上面的代码有很多错误,但这是最相关的
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
数据确实没有编码为UTF-8;除了单个 0x92 字节外,所有内容都是 ASCII:
b'Korea, Dem. People\x92s Rep.'
将其解码为 Windows codepage 1252,其中 0x92 是花哨的引号,’
:
df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
sep=";", encoding='cp1252')
演示:
>>> import pandas as pd
>>> df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
... sep=";", encoding='cp1252')
>>> df1.head()
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 \
0 Afghanistan 55.1 55.5 55.9 56.2 56.6 57.0 57.4 57.8 58.2 58.6
1 Albania 74.3 74.7 75.2 75.5 75.8 76.1 76.3 76.5 76.7 76.8
2 Algeria 70.2 70.6 71.0 71.4 71.8 72.2 72.6 72.9 73.2 73.5
3 American Samoa .. .. .. .. .. .. .. .. .. ..
4 Andorra .. .. .. .. .. .. .. .. .. ..
2010 2011 2012 2013 Unnamed: 15 2014 2015
0 59.0 59.3 59.7 60.0 NaN 60.4 60.7
1 77.0 77.2 77.4 77.6 NaN 77.8 78.0
2 73.8 74.1 74.3 74.6 NaN 74.8 75.0
3 .. .. .. .. NaN .. ..
4 .. .. .. .. NaN .. ..
但是我注意到,当您从URL。当我将数据直接保存到磁盘时, 然后 使用 pd.read_csv()
加载它,数据被正确解码,但是从 URL 加载会产生 re-coded 数据:
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
>>> df1[' '][102].encode('cp1252').decode('utf8')
'Korea, Dem. People’s Rep.'
这是一个 known bug in Pandas. You can work around this by using urllib.request
来加载 URL 并将其传递给 pd.read_csv()
:
>>> import urllib.request
>>> with urllib.request.urlopen("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv") as resp:
... df1 = pd.read_csv(resp, sep=";", encoding='cp1252')
...
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
原来在 mac os 中创建的 csv 正在 windows machine 上被解析,我得到了 UnicodeDecodeError。 要消除此错误,请尝试将参数 encoding='mac-roman' 传递给 pandas 库的 read_csv 方法。
import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";", encoding='mac_roman')
df1.head()
输出:
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Unnamed: 15 2014 2015
0 Afghanistan 55.1 55.5 55.9 56.2 56.6 57.0 57.4 57.8 58.2 58.6 59.0 59.3 59.7 60.0 NaN 60.4 60.7
1 Albania 74.3 74.7 75.2 75.5 75.8 76.1 76.3 76.5 76.7 76.8 77.0 77.2 77.4 77.6 NaN 77.8 78.0
2 Algeria 70.2 70.6 71.0 71.4 71.8 72.2 72.6 72.9 73.2 73.5 73.8 74.1 74.3 74.6 NaN 74.8 75.0
3 American Samoa .. .. .. .. .. .. .. .. .. .. .. .. .. .. NaN .. ..
4 Andorra .. .. .. .. .. .. .. .. .. .. .. .. .. .. NaN .. ..
出现此问题是因为您的文件中有一些未知字符。 例如,在您使用 utf-8 编码的文件中,windows 1250 中有一些字符。 你应该删除或替换这些字符来解决你的问题
这个有效
df = pd.read_csv(inputfile, engine = 'python')