当 python pandas.read_csv 在 Azure 上时,编码没有改变
When python pandas.read_csv on azure, encoding is not changing
通过 python pandas 读取 csv 文件并尝试更改编码,因为一些德国字母,接缝 Azure 始终保持相同的编码(假设默认)。
无论我做什么,在 Azure 门户上总是出现相同的错误:
'utf-8' 编解码器无法解码位置 0 中的字节 0xc4:无效的连续字节堆栈
即使我设置了uft-16、latin1、cp1252等也会出现同样的错误
with pysftp.Connection(host, username=username, password=password, cnopts=cnopts) as sftp:
for i in sftp.listdir_attr():
with sftp.open(i.filename) as f:
df = pd.read_csv(f, delimiter=';', encoding='cp1252')
顺便说一下,在 windows 机器上进行本地测试,它工作正常。
完整错误:
Result: Failure Exception: UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc4 in position 0: invalid continuation byte Stack: File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py",
line 355, in _handle__invocation_request call_result = await self._loop.run_in_executor(
File "/usr/local/lib/python3.8/concurrent/futures/thread.py",
line 57, in run result = self.fn(*self.args, **self.kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py",
line 542, in __run_sync_func return func(**params)
File "/home/site/wwwroot/ce_etl/etl_main.py",
line 141, in main df = pd.read_csv(f, delimiter=';', encoding=r"utf-8-sig", error_bad_lines=False)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/util/_decorators.py",
line 311, in wrapper return func(*args, **kwargs)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py",
line 586, in read_csv return _read(filepath_or_buffer, kwds)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py",
line 488, in _read return parser.read(nrows)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py",
line 1047, in read index, columns, col_dict = self._engine.read(nrows)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/c_parser_wrapper.py",
line 223, in read chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx",
line 801, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx",
line 880, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx",
line 1026, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx",
line 1080, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx",
line 1204, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx",
line 1217, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx",
line 1396, in pandas._libs.parsers._string_box_utf8
您可以使用如下编码:
read_csv('file', encoding = "ISO-8859-1")
另外,如果我们想检测文件自己的编码并将其放入read_csv,我们可以添加如下:
result = chardet.detect(f.read()) #or readline if the file is large
df=pd.read_csv(r'C:\test.csv',encoding=result['encoding'])
参考read_csv自PythonPandasdocumentations
我找到了解决方案。基本上 sftp.open 默认保持 utf-8。为什么 Azure Linux 无法更改 read_csv 方法中的编码仍然是一个问题。
使用 sftp.getfo 将文件作为对象读取,然后解析为 df 可以正常工作:
ba = io.BytesIO()
sftp.getfo(i.filename, ba)
ba.seek(0)
f = io.TextIOWrapper(ba, encoding='cp1252')
df = pd.read_csv(f, delimiter=';', encoding='cp1252', dtype=str,
error_bad_lines=False)
通过 python pandas 读取 csv 文件并尝试更改编码,因为一些德国字母,接缝 Azure 始终保持相同的编码(假设默认)。
无论我做什么,在 Azure 门户上总是出现相同的错误: 'utf-8' 编解码器无法解码位置 0 中的字节 0xc4:无效的连续字节堆栈
即使我设置了uft-16、latin1、cp1252等也会出现同样的错误
with pysftp.Connection(host, username=username, password=password, cnopts=cnopts) as sftp:
for i in sftp.listdir_attr():
with sftp.open(i.filename) as f:
df = pd.read_csv(f, delimiter=';', encoding='cp1252')
顺便说一下,在 windows 机器上进行本地测试,它工作正常。
完整错误:
Result: Failure Exception: UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc4 in position 0: invalid continuation byte Stack: File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py",
line 355, in _handle__invocation_request call_result = await self._loop.run_in_executor(
File "/usr/local/lib/python3.8/concurrent/futures/thread.py",
line 57, in run result = self.fn(*self.args, **self.kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py",
line 542, in __run_sync_func return func(**params)
File "/home/site/wwwroot/ce_etl/etl_main.py",
line 141, in main df = pd.read_csv(f, delimiter=';', encoding=r"utf-8-sig", error_bad_lines=False)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/util/_decorators.py",
line 311, in wrapper return func(*args, **kwargs)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py",
line 586, in read_csv return _read(filepath_or_buffer, kwds)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py",
line 488, in _read return parser.read(nrows)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py",
line 1047, in read index, columns, col_dict = self._engine.read(nrows)
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/c_parser_wrapper.py",
line 223, in read chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx",
line 801, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx",
line 880, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx",
line 1026, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx",
line 1080, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx",
line 1204, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx",
line 1217, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx",
line 1396, in pandas._libs.parsers._string_box_utf8
您可以使用如下编码:
read_csv('file', encoding = "ISO-8859-1")
另外,如果我们想检测文件自己的编码并将其放入read_csv,我们可以添加如下:
result = chardet.detect(f.read()) #or readline if the file is large
df=pd.read_csv(r'C:\test.csv',encoding=result['encoding'])
参考read_csv自PythonPandasdocumentations
我找到了解决方案。基本上 sftp.open 默认保持 utf-8。为什么 Azure Linux 无法更改 read_csv 方法中的编码仍然是一个问题。
使用 sftp.getfo 将文件作为对象读取,然后解析为 df 可以正常工作:
ba = io.BytesIO()
sftp.getfo(i.filename, ba)
ba.seek(0)
f = io.TextIOWrapper(ba, encoding='cp1252')
df = pd.read_csv(f, delimiter=';', encoding='cp1252', dtype=str,
error_bad_lines=False)