Python notebook - 导入带有两个字符分隔符的数据文件会导致错误

Question

问题：我们在将数据文件（具有两个字符定界符）加载到 Azure SQL Db 时遇到以下错误。我们可能做错了什么，如何解决这个问题？

在 Azure Databricks, we are trying to load a data file into Azure SQL Db 中使用 Python 笔记本。数据文件中的分隔符有两个字符~*。在以下代码中，我们得到如下所示的错误：

pandas dataframe low memory not supported with the 'python' engine

代码:

import sqlalchemy as sq
import pandas as pd

data_df = pd.read_csv('/dbfs/FileStore/tables/MyDataFile.txt', sep='~*', engine='python', low_memory=False, quotechar='"', header='infer' , encoding='cp1252')
.............
.............

备注：如果我们去掉low_memory选项，会出现如下错误。尽管对于比此文件更大但分隔符为单个字符的其他数据文件，我们不会收到以下错误。

ConnectException: Connection refused (Connection refused) Error while obtaining a new communication channel ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.

Answer 1

可能是您的文件太大，数据帧不适合内存。您可以尝试拆分处理吗？ IE。读取 1000 个 Limes，从中创建一个数据框，推送到 SQL，Thema 读取接下来的 1000 行等等？

传递给 read_csv 的 nrows 和 skiprows 可用于此。

可能有一个解决方法：使用 sed s/-*/;/g 预处理文件，然后您可以使用内存占用更少的 c 引擎。

Answer 2

来自 Pandas.read_csv() 的文档：

In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine.

因为它被解释为一个正则表达式，并且 * 在正则表达式中有特殊的含义，所以你需要对它进行转义。使用 sep=r'~\*'

Python notebook - 导入带有两个字符分隔符的数据文件会导致错误

Python notebook - Importing data file with delimiter with two characters causes an error

python

azure

pandas

azure-sql-database