当 python pandas.read_csv 在 Azure 上时,编码没有改变

When python pandas.read_csv on azure, encoding is not changing

通过 python pandas 读取 csv 文件并尝试更改编码,因为一些德国字母,接缝 Azure 始终保持相同的编码(假设默认)。

无论我做什么,在 Azure 门户上总是出现相同的错误: 'utf-8' 编解码器无法解码位置 0 中的字节 0xc4:无效的连续字节堆栈

即使我设置了uft-16、latin1、cp1252等也会出现同样的错误

with pysftp.Connection(host, username=username, password=password, cnopts=cnopts) as sftp:
  for i in sftp.listdir_attr():
     with sftp.open(i.filename) as f:
        df = pd.read_csv(f, delimiter=';', encoding='cp1252')

顺便说一下,在 windows 机器上进行本地测试,它工作正常。

完整错误:

Result: Failure Exception: UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc4 in position 0: invalid continuation byte Stack: File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py", 
line 355, in _handle__invocation_request call_result = await self._loop.run_in_executor( 
File "/usr/local/lib/python3.8/concurrent/futures/thread.py", 
line 57, in run result = self.fn(*self.args, **self.kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/azure_functions_worker/dispatcher.py", 
line 542, in __run_sync_func return func(**params) 
File "/home/site/wwwroot/ce_etl/etl_main.py", 
line 141, in main df = pd.read_csv(f, delimiter=';', encoding=r"utf-8-sig", error_bad_lines=False) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/util/_decorators.py", 
line 311, in wrapper return func(*args, **kwargs) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 586, in read_csv return _read(filepath_or_buffer, kwds) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 488, in _read return parser.read(nrows) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/readers.py", 
line 1047, in read index, columns, col_dict = self._engine.read(nrows) 
File "/home/site/wwwroot/.python_packages/lib/site-packages/pandas/io/parsers/c_parser_wrapper.py", 
line 223, in read chunks = self._reader.read_low_memory(nrows) 
File "pandas/_libs/parsers.pyx", 
line 801, in pandas._libs.parsers.TextReader.read_low_memory 
File "pandas/_libs/parsers.pyx", 
line 880, in pandas._libs.parsers.TextReader._read_rows 
File "pandas/_libs/parsers.pyx", 
line 1026, in pandas._libs.parsers.TextReader._convert_column_data 
File "pandas/_libs/parsers.pyx", 
line 1080, in pandas._libs.parsers.TextReader._convert_tokens 
File "pandas/_libs/parsers.pyx", 
line 1204, in pandas._libs.parsers.TextReader._convert_with_dtype 
File "pandas/_libs/parsers.pyx", 
line 1217, in pandas._libs.parsers.TextReader._string_convert 
File "pandas/_libs/parsers.pyx", 
line 1396, in pandas._libs.parsers._string_box_utf8

您可以使用如下编码:

read_csv('file', encoding = "ISO-8859-1")

另外,如果我们想检测文件自己的编码并将其放入read_csv,我们可以添加如下:

result = chardet.detect(f.read()) #or readline if the file is large
df=pd.read_csv(r'C:\test.csv',encoding=result['encoding'])

参考read_csv自PythonPandasdocumentations

我找到了解决方案。基本上 sftp.open 默认保持 utf-8。为什么 Azure Linux 无法更改 read_csv 方法中的编码仍然是一个问题。

使用 sftp.getfo 将文件作为对象读取,然后解析为 df 可以正常工作:

 ba = io.BytesIO()
 sftp.getfo(i.filename, ba)
 ba.seek(0)

 f = io.TextIOWrapper(ba, encoding='cp1252')
 df = pd.read_csv(f, delimiter=';', encoding='cp1252', dtype=str, 
                  error_bad_lines=False)