Pymongo vs pyscopg3：写入的数量级差异？

Question

我有一个从去中心化交易所收集加密货币价格的应用程序（非商业用途，目的主要是学习一些具有真实数据的数据库技术）。我得到它与 MongoDB 和 PostgresSQL 一起工作。但是，我发现写入方面存在巨大差异，支持 MongoDB。我说的是几个数量级。我知道 MongoDB 是一个 NoSQL 数据库，并且被宣传为可以更有效地处理此类问题，但我想知道我的 postgres 实现是否遗漏了什么。我将在下面描述我是如何实现逻辑的，我将尝试提供可比较的指标（尽我所能）。

为了与数据库对话，我将 psycopg3 用于 Postgres 实现，将 pymongo 用于 MongoDB 实现。

这是我写入数据库的数据结构：

class PricingInfo(NamedTuple):
    blockchain_name: str
    dex_name: str
    pair_address: str
    token0_symbol: str
    token1_symbol: str
    token0_address: str
    token1_address: str
    raw_reserve0: int
    raw_reserve1: int
    reserve0: float
    reserve1: float
    mid_price: float
    datetime: pendulum.DateTime
    inverted: bool = False

    @classmethod
    def from_dict(cls, doc: dict) -> PricingInfo:
        doc.pop("_id", None)
        return cls(**doc)

    def to_dict(self) -> dict:
        return self._asdict()

两种实现方式相同。我每秒有数百个要写入数据库。这是我为 postgres 做的：

def register_prices(self, prices: list[PricingInfo]) -> None:
    query = """
    insert into prices (
        blockchain_name,
        dex_name,
        pair_address,
        token0_address,
        token1_address,
        raw_reserve0,
        raw_reserve1,
        reserve0,
        reserve1,
        mid_price,
        datetime
    )
    values (
        %(blockchain_name)s,
        %(dex_name)s,
        %(pair_address)s,
        %(token0_address)s,
        %(token1_address)s,
        %(raw_reserve0)s,
        %(raw_reserve1)s,
        %(reserve0)s,
        %(reserve1)s,
        %(mid_price)s,
        %(datetime)s
    )
    """

    keys_to_keep = {
        "blockchain_name",
        "dex_name",
        "pair_address",
        "token0_address",
        "token1_address",
        "raw_reserve0",
        "raw_reserve1",
        "reserve0",
        "reserve1",
        "mid_price",
        "datetime",
    }

    with psycopg.connect(self.db_uri) as conn:
        with conn.cursor() as cur:
            start = time.perf_counter()
            if len(prices) == 1:
                cur.execute(
                    query,
                    {
                        k: v
                        for k, v in prices[0].to_dict().items()
                        if k in keys_to_keep
                    },
                )
            elif len(prices) > 1:
                cur.executemany(
                    query,
                    [
                        {k: v for k, v in p.to_dict().items() if k in keys_to_keep}
                        for p in prices
                    ],
                )

        conn.commit()
    delta = time.perf_counter() - start

    if self.dex_name in {"pangolin", "trader_joe"}:
        logger.warning(f"Inserting {len(prices)}")
        logger.warning(f"Inserting prices took {delta} seconds")

这是我的 table 定义：

create table prices (
    id serial primary key,
    blockchain_name varchar(100) not null,
    dex_name varchar(100) not null,
    raw_reserve0 decimal not null,
    raw_reserve1 decimal not null,
    reserve0 decimal not null,
    reserve1 decimal not null,
    mid_price decimal not null,
    datetime timestamp with time zone not null,
    pair_address varchar(50) not null,
    token0_address varchar(50) not null,
    token1_address varchar(50) not null,
    foreign key (blockchain_name, dex_name, pair_address) references pairs (blockchain_name, dex_name, pair_address),
    foreign key (blockchain_name, dex_name, token0_address) references tokens (blockchain_name, dex_name, address),
    foreign key (blockchain_name, dex_name, token1_address) references tokens (blockchain_name, dex_name, address)
);

对于Mongo数据库：

def register_prices(self, prices: list[PricingInfo]) -> None:

    start = time.perf_counter()

    prices_table = self._db["prices"]
    prices_table.insert_many(price.to_dict() for price in prices)

    delta = time.perf_counter() - start

    if self.dex_name in {"pangolin", "trader_joe"}:
        logger.warning(f"Inserting {len(prices)}")
        logger.warning(f"Inserting prices took {delta} seconds")

该应用程序运行对两个数据库的处理方式完全相同。使用 postgres 编写时有一个微小的差异，其中数据字典需要稍微修改以适应模式（我在这里做一些规范化），但因为我每次只有约 600 个数据字典要修改写，我觉得这不是瓶颈。在这两种情况下，我有 8 个进程同时写入数据库。

对于 postgres，我得到了这些指标：

Inserting 587
Inserting prices took 1.175270811014343 seconds
Inserting 611
Inserting prices took 0.3126116280036513 seconds

对于mongo：

Inserting 588
Inserting prices took 0.03671051503624767 seconds
Inserting 612
Inserting prices took 0.032324473024345934 seconds

这些时间相对 stable，postgres 大约 1s 和 300ms，Mongo 大约 30ms。非常奇怪的是，对于大致相同的数据量，postgres 有两种不同的写入时间。尽管如此，即使在 postgres 的最佳情况下，它仍然比 mongo.

慢 10 倍

补充说明：

对于 postgres，我有一些外键约束。我尝试删除这些限制，但它并没有显着影响时间
我尝试 alter user postgres set synchronous_commit to off; 用于 postgres，对时间没有明显影响
我运行postgres:14和mongo:4，通过Docker

我是不是在使用 Postgres 做错了什么？

Answer 1

将数据批量导入 Postgres 通常最快的方法是使用 Postgres 命令 COPY. In psycopg3 this is available via the procedures shown here psycopg3 COPY。需要注意的是 COPY 是全部或 none，要么导入了所有数据，要么出错意味着导入了 none 个数据。

Pymongo vs pyscopg3：写入的数量级差异？

Pymongo vs pyscopg3: orders of magnitude difference for writes?

python

postgresql

mongodb

psycopg3