如何在KDB 上快速向磁盘table 插入数据？

Question

我有大约 10,000 个符号的五年分钟柱，作为 CSV 文件。它总共有大约 50GB 的文本。我的内存是 32GB。

我正在尝试将所有这些数据加载到 KDB table，以便于查询。

symbols: `$(-4_')(string') key `:/home/chris/sync/us_equities

f: {`$(":/home/chris/sync/us_equities/", x, ".csv")}

load_symbol: {(0!(update "P"$-1_/:t, s: x from flip `t`o`h`l`c`v!("*FFFFI";",")0: f(string x))) }

({`:/home/chris/sync/price_data insert (load_symbol x)}) each symbols

我应该使用直截了当的 table，还是应该使用 partitions/splays？
我将代码添加为符号类型的额外列；是吗？
最后一行 insert 非常慢。看起来需要大约一天的时间来处理，也许更长时间。我该如何优化它？我试过 peach 但这更慢。看起来它开始时非常快，并且随着 each.

谢谢！

Answer 1

由于数据大小和更新频率，不建议在这种情况下使用平面文件。每次插入时都需要从头开始重新创建文件，导致插入时间与总行数成线性关系。

q)t:([]a:til 10000000;b:til 10000000)
q)`:t set t
`:t
q)\t `:t insert t
305
q)\t `:t insert t
365
q)\t `:t insert t
574
q)\t `:t insert t
809
q)\t `:t insert t
1236
q)\t `:t insert t
2687
q)\t `:t insert t
3200

将此与展开的 table 进行比较，其中新数据中的每一列都附加到相应的文件，从而导致不断插入。

q)t:([]a:til 10000000;b:til 10000000)
q)`:t/ set t
`:t/
q)\t `:t insert t
166
q)\t `:t insert t
101
q)\t `:t insert t
97
q)\t `:t insert t
100
q)\t `:t insert t
111
q)\t `:t insert t
113

如果该符号不在文件中，那么最好将其添加到 table 中。但是我建议将列命名为 sym 而不是 s。这只是因为它在 kdb 中是约定俗成的，而且一些内置函数采用了这个名称。

根据我的估计，这个 table 对于简单的八字形 table 来说太大了。我会按日期或月份对其进行分区，具体取决于您运行.

的查询类型

按 sym 排序并添加 parted attribute is a must if your queries will often select a subset of the syms. Sorting by time bucket within each sym is required to use an asof join，因为它使用二进制搜索。

以下代码将执行此操作，但由于您的文件已被 sym 分隔，因此您应该能够跳过 sym 排序。

/ to sort a table in memory and apply parted attribute
update `p#sym from `sym`time xasc data 
/ to sort a table on disk and apply parted attribute
sym`time xasc `:path/to/partition
@[`:path/to/partition;`sym;`p#]

如果您的查询更适合在所有交易品种中选择特定时间 window，您最好仅按时间段排序并将排序属性应用于此列。

此外，您可能需要考虑使用 .Q.fs or .Q.fsn 流式传输 csv 文件，以减少任何单个加载的内存使用量。这将允许您使用多线程或其他进程以相同的内存开销加载数据。

如何在KDB 上快速向磁盘table 插入数据？

How do I insert data to a disk table quickly on KDB?

kdb