如何加速 pandas to_sql

How to speed up pandas to_sql

我正在尝试使用 pandas to_sql 将数据上传到 MS Azure Sql 数据库,这需要很长时间。我经常在睡觉前和早上醒来之前必须 运行 它,它已经完成,但花了几个小时,如果出现错误,我无法解决它。这是我的代码:

params = urllib.parse.quote_plus(
'Driver=%s;' % driver +
'Server=%s,1433;' % server +
'Database=%s;' % database +
'Uid=%s;' % username +
'Pwd={%s};' % password +
'Encrypt=yes;' +
'TrustServerCertificate=no;'
)

conn_str = 'mssql+pyodbc:///?odbc_connect=' + params
engine = create_engine(conn_str)

@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True
        cursor.commit()
        
connection = engine.connect()
connection

然后我 运行 这个命令用于 sql 摄取:

master_data.to_sql('table_name', engine, chunksize=500, if_exists='append', method='multi',index=False)

我试过块大小,最佳点似乎是 100,考虑到我通常尝试一次上传 800,000-2,000,000 条记录,这还不够快。如果我将它增加到超出该范围,我将收到一个错误,该错误似乎只与块大小有关。

OperationalError: (pyodbc.OperationalError) ('08S01', '[08S01] [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure (0) (SQLExecDirectW)')

不确定您的问题是否已解决,但确实想在此处提供答案,以便提供 Azure SQL Database libraries for Python 特定信息和一些有用的资源来调查和解决此问题(如适用)。

使用pyodbc直接查询Azure SQL数据库的例子: Quickstart: Use Python to query Azure SQL Database Single Instance & Managed Instance

使用 Pandas 数据框的示例:How to read and write to an Azure SQL database from a Pandas dataframe

main.py

"""Read write to Azure SQL database from pandas"""
import pyodbc
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

# 1. Constants
AZUREUID = 'myuserid'                                    # Azure SQL database userid
AZUREPWD = '************'                                # Azure SQL database password
AZURESRV = 'shareddatabaseserver.database.windows.net'   # Azure SQL database server name (fully qualified)
AZUREDB = 'Pandas'                                      # Azure SQL database name (if it does not exit, pandas will create it)
TABLE = 'DataTable'                                      # Azure SQL database table name
DRIVER = 'ODBC Driver 13 for SQL Server'                 # ODBC Driver

def main():
"""Main function"""

# 2. Build a connectionstring
connectionstring = 'mssql+pyodbc://{uid}:{password}@{server}:1433/{database}?driver={driver}'.format(
    uid=AZUREUID,
    password=AZUREPWD,
    server=AZURESRV,
    database=AZUREDB,
    driver=DRIVER.replace(' ', '+'))

# 3. Read dummydata into dataframe 
df = pd.read_csv('./data/data.csv')

# 4. Create SQL Alchemy engine and write data to SQL
engn = create_engine(connectionstring)
df.to_sql(TABLE, engn, if_exists='append')

# 5. Read data from SQL into dataframe
query = 'SELECT * FROM {table}'.format(table=TABLE)
dfsql = pd.read_sql(query, engn)

print(dfsql.head())


if __name__ == "__main__":
    main()

最后,以下资源应有助于比较特定实现、性能问题和以下信息,其中 Stack Overflow 线程可能是最佳资源,但监控和性能调整文档有助于调查和缓解 ay服务器端性能问题等

Monitoring and performance tuning in Azure SQL Database and Azure SQL Managed Instance

此致, 麦克

params = urllib.parse.quote_plus(
    'Driver=%s;' % driver +
    'Server=%s,1433;' % server +
    'Database=%s;' % database +
    'Uid=%s;' % username +
    'Pwd={%s};' % password +
    'Encrypt=yes;' +
    'TrustServerCertificate=no;'
    )


conn_str = 'mssql+pyodbc:///?odbc_connect=' + params
engine = create_engine(conn_str)

@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True
        cursor.commit()
        
connection = engine.connect()
connection

下一行完成数据库摄取。我之前遇到过 chunksize 的问题,但通过添加方法和索引修复了它。

ingest_data.to_sql('db_table_name', engine, if_exists='append',chunksize=100000, method=None,index=False)