使用 pyarrow 编写箭头数据集时如何解决 "Too many open files error"?
How to work around "Too many open files error" when writing arrow dataset with pyarrow?
import pyarrow as pa
f = 'my_partitioned_big_dataset'
ds = dataset.dataset(f, format='parquet', partitioning='hive')
s = ds.scanner()
pa.dataset.write_dataset(s.head(827981), 'here', format="arrow", partitioning=ds.partitioning) # is ok
pa.dataset.write_dataset(s.head(827982), 'here', format="arrow", partitioning=ds.partitioning) # fails
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-54-9160d6de8c45> in <module>
----> 1 pa.dataset.write_dataset(s.head(827982), 'here', format="arrow", partitioning=ds.partitioning)
...
OSError: [Errno 24] Failed to open local file '...'. Detail: [errno 24] Too many open files
我在 linux (ubuntu)。 ulimit 似乎没问题?
$ ulimit -Hn
524288
$ ulimit -Sn
1024
$ cat /proc/sys/fs/file-max
9223372036854775807
ulimit -Ha
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 128085
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 524288
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 128085
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 128085
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 128085
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
关于如何解决这个问题的任何想法?我有一种感觉,我已经将我的 ulimit 设置得相当高,但也许我可以调整它。 pyarrow 可能有一些功能可以即时释放打开的文件?
目前的代码无法控制这个。此功能 (max_open_files
) 最近已添加到 C++ 库中,并且 ARROW-13703 跟踪将其添加到 python 库中。我不确定它是否会成为 6.0 的截止日期(6.0 应该很快发布)。
与此同时,您的打开文件限制 ((-n) 1024
) 是默认设置,有点保守。您应该能够非常安全地将限制增加几千。有关更多讨论,请参阅 this question。
import pyarrow as pa
f = 'my_partitioned_big_dataset'
ds = dataset.dataset(f, format='parquet', partitioning='hive')
s = ds.scanner()
pa.dataset.write_dataset(s.head(827981), 'here', format="arrow", partitioning=ds.partitioning) # is ok
pa.dataset.write_dataset(s.head(827982), 'here', format="arrow", partitioning=ds.partitioning) # fails
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-54-9160d6de8c45> in <module>
----> 1 pa.dataset.write_dataset(s.head(827982), 'here', format="arrow", partitioning=ds.partitioning)
...
OSError: [Errno 24] Failed to open local file '...'. Detail: [errno 24] Too many open files
我在 linux (ubuntu)。 ulimit 似乎没问题?
$ ulimit -Hn
524288
$ ulimit -Sn
1024
$ cat /proc/sys/fs/file-max
9223372036854775807
ulimit -Ha
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 128085
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 524288
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 128085
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 128085
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 128085
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
关于如何解决这个问题的任何想法?我有一种感觉,我已经将我的 ulimit 设置得相当高,但也许我可以调整它。 pyarrow 可能有一些功能可以即时释放打开的文件?
目前的代码无法控制这个。此功能 (max_open_files
) 最近已添加到 C++ 库中,并且 ARROW-13703 跟踪将其添加到 python 库中。我不确定它是否会成为 6.0 的截止日期(6.0 应该很快发布)。
与此同时,您的打开文件限制 ((-n) 1024
) 是默认设置,有点保守。您应该能够非常安全地将限制增加几千。有关更多讨论,请参阅 this question。