更快地将具有 3 亿行的单列 txt 转换为 NumPy 数组

Question

我有一个包含超过 3 亿行和 1 列的 txt 文件。我正在尝试读取并将其转换为 numpy 数组。目前我已经试过了

label = np.loadtxt('/path/to/file')

和

for lines in fileinput.input('/path/to/file'):
    do_something_with(lines)

似乎 np.loadtxt 的性能稍快一些，但读取单个 txt 文件仍需要数小时。该文件有超过 3 亿行和 1 列，但它的大小只有 950 MB 左右。我怀疑 np.loadtxt 也在逐行读取文件，这导致处理时间很长。

我想知道是否有任何方法可以加快读取和转换过程，同时保持行的顺序。

Answer 1

听起来文件很简单，readlines 可以工作

制作一个小示例文件：

In [2]: arr = np.random.random((100,1))
In [4]: np.savetxt('test.txt', arr, fmt='%f')
In [6]: !head test.txt
0.872225
0.365394
0.802365
0.140455
0.041390
0.531483
0.415459
0.906439
0.789604
0.493369

直截了当loadtxt:

In [8]: arr1 = np.loadtxt('test.txt')
In [9]: arr1.shape
Out[9]: (100,)

我认为有一个参数可以强制 (100,1) 形状，我暂时保留它。

让我们尝试 readlines，使用 np.array 将字符串列表转换为浮点数组：

In [11]: arr2 = np.array(open('test.txt').readlines(), dtype=float)
In [12]: arr2.shape
Out[12]: (100,)
In [13]: np.allclose(arr1,arr2)
Out[13]: True

比较时间：

In [14]: timeit arr2 = np.array(open('test.txt').readlines(), dtype=float)
77.5 µs ± 961 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [15]: timeit arr1 = np.loadtxt('test.txt')
605 µs ± 1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

readlines 快很多。

另一种方法是fromfile:

In [18]: arr3 = np.fromfile('test.txt',dtype=float, sep=' ')
In [19]: arr3.shape
Out[19]: (100,)
In [20]: np.allclose(arr1,arr3)
Out[20]: True
In [21]: timeit arr3 = np.fromfile('test.txt',dtype=float, sep=' ')
118 µs ± 4.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

没有那么快，但仍然比 loadtxt 好。是的，loadtxt 逐行读取文件。

genfromtxt 比 loadtxt.

好一点

pandas 应该有一个快速的 csv reader，但这里似乎不是这样：

In [33]: timeit arr4=pd.read_csv('test.txt',header=None).to_numpy()
1.12 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

更快地将具有 3 亿行的单列 txt 转换为 NumPy 数组

Convert single-column txt with 300 million rows to NumPy array faster

python

performance

file-io

numpy