将 API JSON 数据发送到 Exasol DB table

Sending API JSON data to Exasol DB table

我正在处理来自虚拟 JSON 网站的虚假 JSON 数据,如下所示:

[
  {
    "postId": 1,
    "id": 1,
    "name": "id labore ex et quam laborum",
    "email": "Eliseo@gardner.biz",
    "body": "laudantium enim quasi est quidem magnam voluptate ipsam eos\ntempora quo necessitatibus\ndolor quam autem quasi\nreiciendis et nam sapiente accusantium"
  },
  {
    "postId": 1,
    "id": 2,
    "name": "quo vero reiciendis velit similique earum",
    "email": "Jayne_Kuhic@sydney.com",
    "body": "est natus enim nihil est dolore omnis voluptatem numquam\net omnis occaecati quod ullam at\nvoluptatem error expedita pariatur\nnihil sint nostrum voluptatem reiciendis et"
  }
]

我通过 requests 库读入 API 数据,然后转身将其发送到 Exasol 数据库 table。请参阅下面的代码

import requests
import pyexasol

def get_api_data():
    r = requests.get("http://jsonplaceholder.typicode.com/comments")
    data = r.json()
    return data
    
def connection():
    session = pyexasol.connect_local_config('my_exasol')
    return session

def send_api_data():
    s = connection()
    data = get_api_data()
    for row in data:
        s.execute("""INSERT INTO TESTBED.TEST_API(postId, id, name, email, body) VALUES ({postId}, {id},{name},
        {email}, {body})""", {'postId': row['postId'], 'id': row['id'], 'name': row['name'], 'email': row['email'],
        'body': row['body']})

send_api_data()

这工作正常,问题是它非常慢。插入 500 条记录需要将近 2 分钟。我知道必须有一种更有效的方法来做到这一点。实际上,我将从一个 API 中提取数据,其中有数千条记录,我想将这些记录发送到数据库 table.

关于更好的方法有什么想法吗?

在 Exasol 中执行单个插入语句很慢,因为它是一个基于列的数据库。你应该使用 IMPORT instead. Make sure to read the best practices for pyexasol as well. Also consider enabling compression.

对于您的示例,请尝试以下操作。在我的案例中,导入数据需要 0.7 秒。

import requests
import pyexasol
import pandas
import time

def get_api_data():
    r = requests.get("http://jsonplaceholder.typicode.com/comments")
    data = r.json()
    return data
    
def connection():
    session = pyexasol.connect_local_config('my_exasol')
    return session

def send_api_data():
    s = connection()
    data = get_api_data()

    data_for_import = [(row['postId'], row['id'], row['name'], row['email'], row['body']) for row in data]
    start = time.time()
    s.import_from_iterable(data_for_import, ("TESTBED","TEST_API"))
    print("Finished import after ", time.time() - start, " seconds")

send_api_data()