Numpy 字符串数组 - 在 numpy 字符串数组上调用 tobytes() 的奇怪行为

Question

我正在尝试使用 Numpy 对操作进行矢量化，以解析包含数字行的文本文件并将数据转换为 numpy 数组。文本文件中的数据如下所示：

*** .txt file ***

1 0 0 0 0
2 1 0 0 0
3 1 1 0 0
4 0 1 0 0
5 0 0 1 0
6 1 0 1 0
7 1 1 1 0
8 0 1 1 0
9 0.5 0.5 0 0
10 0.5 0.5 1 0
11 0.5 0 0.5 0
12 1 0.5 0.5 0
13 0.5 1 0.5 0
14 0 0.5 0.5 0

*** /.txt file ***

我的方法是读取使用 file.readlines() 中的行，然后将该行字符串列表转换为一个 numpy 数组，如下所示 - file.readlines() 省略测试部分。

short_list = ['1 0 0 0 0\n',
              '2 1 0 0 0\n',
              '3 1 1 0 0\n']

long_list = ['1 0 0 0 0\n',
             '2 1 0 0 0\n',
             '3 1 1 0 0\n',
             '4 0 1 0 0\n',
             '5 0 0 1 0\n',
             '6 1 0 1 0\n',
             '7 1 1 1 0\n',
             '8 0 1 1 0\n',
             '9 0.5 0.5 0 0\n',
             '10 0.5 0.5 1 0\n',
             '11 0.5 0 0.5 0\n',
             '12 1 0.5 0.5 0\n',
             '13 0.5 1 0.5 0\n',
             '14 0 0.5 0.5 0\n']


def lines_to_npy(lines):
    n_lines = len(lines)
    lines_array = np.array(lines).astype('S')
    tmp = lines_array.tobytes().decode('ascii')
    print(repr(tmp))
    print(lines_array.dtype)
    print(np.array(tmp.split(), dtype=np.int32).reshape(n_lines, -1))

lines_to_npy(short_list)
lines_to_npy(long_list)

使用 short_list 调用函数会产生以下输出：

'1 0 0 0 0\n2 1 0 0 0\n3 1 1 0 0\n'
|S10
[[1 0 0 0 0]
 [2 1 0 0 0]
 [3 1 1 0 0]]

这是期望的结果（通过阅读我了解到“|S10”意味着数组中的每个元素都是一个 10 个字符的字符串，字节顺序无关紧要）。但是，使用长列表调用会在每个字符串的末尾插入几个空字符 \x00，这使得它更难解析。

'1 0 0 0 0\n\x00\x00\x00\x00\x002 1 0 0 0\n\x00\x00\x00\x00\x003 1 1 0 0\n\x00\x00\x00\x00\x004 0 1 0 0\n\x00\x00\x00\x00\x005 0 0 1 0\n\x00\x00\x00\x00\x006 1 0 1 0\n\x00\x00\x00\x00\x007 1 1 1 0\n\x00\x00\x00\x00\x008 0 1 1 0\n\x00\x00\x00\x00\x009 0.5 0.5 0 0\n\x0010 0.5 0.5 1 0\n11 0.5 0 0.5 0\n12 1 0.5 0.5 0\n13 0.5 1 0.5 0\n14 0 0.5 0.5 0\n'
|S15

请注意，在将空字符加载到数组中时，我的函数中出现了一个错误，阻止了最终结果。我知道“便宜又肮脏”的解决方案是只去掉末尾的空字符。我也知道我也可以使用 Pandas 来完成主要目标，但我想了解为什么会出现这种行为。

在每个字符串的末尾填充\x00，使每个字符串的长度为15。这种是有道理的，因为短数组的dtype是|S10 , 每个字符串恰好是 10 个字符长。长数组包含 14 个字符串，dtype 是 |S15，附加了额外的 \x00 以使数组中每个项目的长度为 15 个字符。

我很困惑，因为字符串列表中的元素数量（3 对 14）与每个字符串的长度没有相关性，所以我不明白为什么 dtype 更改为 |S15添加更多列表元素。

更新： 我对如何有效地将数据从文本文件读取到 numpy 数组进行了更多研究。我需要一种快速的方法来执行此操作，因为我正在读取约 1000 万行的文件。 numpy.loadfromtxt()和numpy.genfromtxt()是候选方案，但是速度很慢，因为它们是在Python中实现的，基本上和手动循环file.readlines()、剥离、拆分做同样的事情行字符串 (). I noticed in my own testing that using numpy.loadtxt() was about twice as slow as the aforementioned manual method, which was also noted .

我发现使用 pandas.from_csv().to_numpy()，我能够获得 ~10x 循环 file.readlines() 的加速。请参阅此答案 here。希望这对将来使用相同应用程序的任何人有所帮助。

Answer 1

I am trying to use Numpy to vectorize an operation to parse a text file containing lines of numbers and convert the data into a numpy array.

向量化与读取您的数据无关。做，例如tmp.split() 仍在对普通 Python 字符串对象调用普通 Python 函数，结果创建了许多 Python 字符串对象，并在主 [=74] 中执行=] 字节码解释器循环。再多的 Numpy 代码也不会改变这一点。

也就是说，这里无论如何都没有有意义的性能提升。 与从硬盘驱动器获取内容相比，任何半途读取和解释（即解析）文件的方法都快如闪电 ，甚至比从 SSD 读取快得多。

My approach is to read the lines in using file.readlines(), then convert that list of line strings into a numpy array as follows - file.readlines() part omitted for testing.

不要那样做。整个过程比必要的复杂得多。继续阅读。

tmp = lines_array.tobytes().decode('ascii')

这只是为您提供文件的原始内容，您可以直接使用 .read() 而不是 .readlines()。

from reading around I gather that "|S10" means that each element in the array is a 10 character string for which the endianness doesn't matter

不完全是；元素是每个 10 个字节的数组（在 C 意义上）。它们 不是“字符串”；它们是原始数据，可能被解释为文本。

字符串 '1 0 0 0 0\n'，当使用默认编码编码为字节时，使用 10 个字节。 short_list 中的所有其他字符串也是如此。因此，“10 字节数组”是合适的数据类型。

calling with the long list inserts several null characters \x00 at the end of each string which makes it harder to parse.

不插入“空字符”；它插入 null bytes（数值为 0）。它这样做是因为它需要 15 个字节来存储 '14 0 0.5 0.5 0\n' 的编码表示，并且 每个元素必须具有相同的大小 。

请记住，您文本中的符号 0 被翻译成单个字节，的数值不为零。它的数值为 48。

再次声明：所有这些编码和 re-encoding 步骤都没有用 - 您可以通过 .read() 使用文件中的原始数据 - 所有 .readlines() 都在帮助您with 用于确定文件中的行数。

但你不想也不需要做任何。

你想要的逻辑直接内置在 Numpy 中。你should have found this out for yourself by using a search engine.

您可以直接让 Numpy 为您加载文件，您应该这样做：numpy.loadtxt('myfile.txt').

Numpy 字符串数组 - 在 numpy 字符串数组上调用 tobytes() 的奇怪行为

Numpy string arrays - Strange behavior calling tobytes() on numpy string array

python

arrays

numpy

dtype