使用 pyarrow 编写箭头数据集时如何解决 "Too many open files error"?

How to work around "Too many open files error" when writing arrow dataset with pyarrow?

import pyarrow as pa
f = 'my_partitioned_big_dataset'
ds = dataset.dataset(f, format='parquet', partitioning='hive')
s = ds.scanner()
pa.dataset.write_dataset(s.head(827981), 'here', format="arrow", partitioning=ds.partitioning)  # is ok
pa.dataset.write_dataset(s.head(827982), 'here', format="arrow", partitioning=ds.partitioning)  # fails
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-54-9160d6de8c45> in <module>
----> 1 pa.dataset.write_dataset(s.head(827982), 'here', format="arrow", partitioning=ds.partitioning)
...
OSError: [Errno 24] Failed to open local file '...'. Detail: [errno 24] Too many open files

我在 linux (ubuntu)。 ulimit 似乎没问题?

$ ulimit -Hn
524288
$ ulimit -Sn
1024
$ cat /proc/sys/fs/file-max
9223372036854775807

ulimit -Ha
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 128085
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 524288
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 128085
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 128085
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 128085
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

关于如何解决这个问题的任何想法?我有一种感觉,我已经将我的 ulimit 设置得相当高,但也许我可以调整它。 pyarrow 可能有一些功能可以即时释放打开的文件?

目前的代码无法控制这个。此功能 (max_open_files) 最近已添加到 C++ 库中,并且 ARROW-13703 跟踪将其添加到 python 库中。我不确定它是否会成为 6.0 的截止日期(6.0 应该很快发布)。

与此同时,您的打开文件限制 ((-n) 1024) 是默认设置,有点保守。您应该能够非常安全地将限制增加几千。有关更多讨论,请参阅 this question