如何加速 pandas to_sql
How to speed up pandas to_sql
我正在尝试使用 pandas to_sql 将数据上传到 MS Azure Sql 数据库,这需要很长时间。我经常在睡觉前和早上醒来之前必须 运行 它,它已经完成,但花了几个小时,如果出现错误,我无法解决它。这是我的代码:
params = urllib.parse.quote_plus(
'Driver=%s;' % driver +
'Server=%s,1433;' % server +
'Database=%s;' % database +
'Uid=%s;' % username +
'Pwd={%s};' % password +
'Encrypt=yes;' +
'TrustServerCertificate=no;'
)
conn_str = 'mssql+pyodbc:///?odbc_connect=' + params
engine = create_engine(conn_str)
@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
cursor.commit()
connection = engine.connect()
connection
然后我 运行 这个命令用于 sql 摄取:
master_data.to_sql('table_name', engine, chunksize=500, if_exists='append', method='multi',index=False)
我试过块大小,最佳点似乎是 100,考虑到我通常尝试一次上传 800,000-2,000,000 条记录,这还不够快。如果我将它增加到超出该范围,我将收到一个错误,该错误似乎只与块大小有关。
OperationalError: (pyodbc.OperationalError) ('08S01', '[08S01] [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure (0) (SQLExecDirectW)')
不确定您的问题是否已解决,但确实想在此处提供答案,以便提供 Azure SQL Database libraries for Python 特定信息和一些有用的资源来调查和解决此问题(如适用)。
使用pyodbc
直接查询Azure SQL数据库的例子:
Quickstart: Use Python to query Azure SQL Database Single Instance & Managed Instance
使用 Pandas 数据框的示例:How to read and write to an Azure SQL database from a Pandas dataframe
main.py
"""Read write to Azure SQL database from pandas"""
import pyodbc
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
# 1. Constants
AZUREUID = 'myuserid' # Azure SQL database userid
AZUREPWD = '************' # Azure SQL database password
AZURESRV = 'shareddatabaseserver.database.windows.net' # Azure SQL database server name (fully qualified)
AZUREDB = 'Pandas' # Azure SQL database name (if it does not exit, pandas will create it)
TABLE = 'DataTable' # Azure SQL database table name
DRIVER = 'ODBC Driver 13 for SQL Server' # ODBC Driver
def main():
"""Main function"""
# 2. Build a connectionstring
connectionstring = 'mssql+pyodbc://{uid}:{password}@{server}:1433/{database}?driver={driver}'.format(
uid=AZUREUID,
password=AZUREPWD,
server=AZURESRV,
database=AZUREDB,
driver=DRIVER.replace(' ', '+'))
# 3. Read dummydata into dataframe
df = pd.read_csv('./data/data.csv')
# 4. Create SQL Alchemy engine and write data to SQL
engn = create_engine(connectionstring)
df.to_sql(TABLE, engn, if_exists='append')
# 5. Read data from SQL into dataframe
query = 'SELECT * FROM {table}'.format(table=TABLE)
dfsql = pd.read_sql(query, engn)
print(dfsql.head())
if __name__ == "__main__":
main()
最后,以下资源应有助于比较特定实现、性能问题和以下信息,其中 Stack Overflow 线程可能是最佳资源,但监控和性能调整文档有助于调查和缓解 ay服务器端性能问题等
Monitoring and performance tuning in Azure SQL Database and Azure SQL Managed Instance
此致,
麦克
params = urllib.parse.quote_plus(
'Driver=%s;' % driver +
'Server=%s,1433;' % server +
'Database=%s;' % database +
'Uid=%s;' % username +
'Pwd={%s};' % password +
'Encrypt=yes;' +
'TrustServerCertificate=no;'
)
conn_str = 'mssql+pyodbc:///?odbc_connect=' + params
engine = create_engine(conn_str)
@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
cursor.commit()
connection = engine.connect()
connection
下一行完成数据库摄取。我之前遇到过 chunksize 的问题,但通过添加方法和索引修复了它。
ingest_data.to_sql('db_table_name', engine, if_exists='append',chunksize=100000, method=None,index=False)
我正在尝试使用 pandas to_sql 将数据上传到 MS Azure Sql 数据库,这需要很长时间。我经常在睡觉前和早上醒来之前必须 运行 它,它已经完成,但花了几个小时,如果出现错误,我无法解决它。这是我的代码:
params = urllib.parse.quote_plus(
'Driver=%s;' % driver +
'Server=%s,1433;' % server +
'Database=%s;' % database +
'Uid=%s;' % username +
'Pwd={%s};' % password +
'Encrypt=yes;' +
'TrustServerCertificate=no;'
)
conn_str = 'mssql+pyodbc:///?odbc_connect=' + params
engine = create_engine(conn_str)
@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
cursor.commit()
connection = engine.connect()
connection
然后我 运行 这个命令用于 sql 摄取:
master_data.to_sql('table_name', engine, chunksize=500, if_exists='append', method='multi',index=False)
我试过块大小,最佳点似乎是 100,考虑到我通常尝试一次上传 800,000-2,000,000 条记录,这还不够快。如果我将它增加到超出该范围,我将收到一个错误,该错误似乎只与块大小有关。
OperationalError: (pyodbc.OperationalError) ('08S01', '[08S01] [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure (0) (SQLExecDirectW)')
不确定您的问题是否已解决,但确实想在此处提供答案,以便提供 Azure SQL Database libraries for Python 特定信息和一些有用的资源来调查和解决此问题(如适用)。
使用pyodbc
直接查询Azure SQL数据库的例子:
Quickstart: Use Python to query Azure SQL Database Single Instance & Managed Instance
使用 Pandas 数据框的示例:How to read and write to an Azure SQL database from a Pandas dataframe
main.py
"""Read write to Azure SQL database from pandas"""
import pyodbc
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
# 1. Constants
AZUREUID = 'myuserid' # Azure SQL database userid
AZUREPWD = '************' # Azure SQL database password
AZURESRV = 'shareddatabaseserver.database.windows.net' # Azure SQL database server name (fully qualified)
AZUREDB = 'Pandas' # Azure SQL database name (if it does not exit, pandas will create it)
TABLE = 'DataTable' # Azure SQL database table name
DRIVER = 'ODBC Driver 13 for SQL Server' # ODBC Driver
def main():
"""Main function"""
# 2. Build a connectionstring
connectionstring = 'mssql+pyodbc://{uid}:{password}@{server}:1433/{database}?driver={driver}'.format(
uid=AZUREUID,
password=AZUREPWD,
server=AZURESRV,
database=AZUREDB,
driver=DRIVER.replace(' ', '+'))
# 3. Read dummydata into dataframe
df = pd.read_csv('./data/data.csv')
# 4. Create SQL Alchemy engine and write data to SQL
engn = create_engine(connectionstring)
df.to_sql(TABLE, engn, if_exists='append')
# 5. Read data from SQL into dataframe
query = 'SELECT * FROM {table}'.format(table=TABLE)
dfsql = pd.read_sql(query, engn)
print(dfsql.head())
if __name__ == "__main__":
main()
最后,以下资源应有助于比较特定实现、性能问题和以下信息,其中 Stack Overflow 线程可能是最佳资源,但监控和性能调整文档有助于调查和缓解 ay服务器端性能问题等
此致, 麦克
params = urllib.parse.quote_plus(
'Driver=%s;' % driver +
'Server=%s,1433;' % server +
'Database=%s;' % database +
'Uid=%s;' % username +
'Pwd={%s};' % password +
'Encrypt=yes;' +
'TrustServerCertificate=no;'
)
conn_str = 'mssql+pyodbc:///?odbc_connect=' + params
engine = create_engine(conn_str)
@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
cursor.commit()
connection = engine.connect()
connection
下一行完成数据库摄取。我之前遇到过 chunksize 的问题,但通过添加方法和索引修复了它。
ingest_data.to_sql('db_table_name', engine, if_exists='append',chunksize=100000, method=None,index=False)