正确选择由字典组成的数据库

Question

我有一本大词典，格式如下：

dict["randomKey"]=[dict1,dict2,int,string]

大概会有几万个key。 dict1 本身有 ~ 100 个键。

问题是：我需要把这本词典存放在一台服务器上，供多台机器读取。最好的格式是什么？

我现在用的是shelve，非常好用。但是，我需要从主字典 (dict) 中获取所有键，以获得来自 dict1 或 dict2 的键的某个值，这需要一些时间，我担心什么时候字典会更大，因为在 50k 键中，这将需要很长时间。我读过 sqlite3，它似乎是一个不错的选择，但我不知道它是否适合我的需求。

除了 Python 程序之外，我真的不需要数据库可以访问（虽然这会很好）但我需要它快速、稳定并且能够让许多计算机读取同时从中。谢谢！

Answer 1

我会选择具有本机 json 支持的数据库，它可以在 json 词典中高效地进行搜索。我喜欢 PostgreSQL:

A table 为您的数据：

create table dict (
  key text primary key,
  dict1 jsonb not null default '{}',
  dict2 jsonb not null default '{}',
  intval integer not null,
  strval text not null
);

用一些示例值填充它：

insert into dict
select
  i::text,
  (select
    jsonb_object(
      array_agg('k'||v::text),
      array_agg('v'||(v+i)::text)
    ) from generate_series(1,1000) as v
  ),
  (select
    jsonb_object(
      array_agg('k'||v::text),
      array_agg('v'||(v+i)::text)
    ) from generate_series(1,1000) as v
  ),
  i,
  i::text
from generate_series(1,10000) as i;

获取 dict1 中键 k6 的值为 v134 的键：

select key from dict where dict1 @> '{"k6":"v134"}';
 key 
-----
 128
(1 row)

Time: 232.843 ms

如果您的 table 非常大，您甚至可以为字典列编制索引以加快搜索速度。但是这些索引会比 table 本身大，数据库可以决定不使用它们更安全：

create index dict_dict1_idx on dict using gin(dict1);
create index dict_dict2_idx on dict using gin(dict2);

你可以强制数据库使用索引，如果你知道它是有益的：

set enable_seqscan=off;
select key from dict where dict1 @> '{"k6":"v134"}';
 key 
-----
 128
(1 row)

Time: 8.955 ms

正确选择由字典组成的数据库

Proper choice for a database consisting of a dictionary

python

sqlite

dictionary

shelve