使用 numpy 的 genfromtxt 加载带有 python 的三角矩阵

Question

我有一个包含上部 'triangular' 矩阵的文本文件，省略了下部值（下面是一个示例）：

3 5 3 5 1 8 1 6 5 8

5 8 1 1 6 2 9 6 4

2 0 5 2 1 0 0 3

2 2 5 1 0 1 0

1 3 6 3 6 1

4 2 4 3 7

4 0 0 1

0 1 8

2 1

1

由于相关文件的大小约为 10000 行，我想知道是否有 'smart' 方法从中生成 numpy 矩阵，例如使用 genfromtxt 函数。但是直接使用它会在以下行引发错误 Line #12431 (got 6 columns instead of 12437) 和使用 filling_values 将不起作用，因为无法指定无缺失值占位符。

现在我不得不求助于手动打开和关闭文件：

import numpy as np
def load_updiag(filename, size):
    output = np.zeros((size,size))
    line_count = 0
    for line in f:
        data = line.split()
        output[line_count,line_count:size]= data
        line_count += 1
    return output

我觉得这对于大文件可能不是很可扩展。有没有办法在此类矩阵上正确使用 genfromtxt（或 numpy 库中的任何其他优化函数）？

Answer 1

可以将文件中的原始数据读入字符串，然后用np.fromstring得到矩阵上三角部分的一维数组：

with open('data.txt') as data_file:
    data = data_file.read()

arr = np.fromstring(data, sep=' ')

或者，您可以定义一个生成器来一次读取文件的一行，然后使用 np.fromiter 从该生成器读取一维数组：

def iter_data(path):
    with open(path) as data_file:
        for line in data_file:
            yield from line.split()

arr = np.fromiter(iter_data('data.txt'), int)

如果您知道矩阵的大小（可以从文件的第一行确定），您可以指定 np.fromiter 的 count 关键字参数，这样函数将预- 分配恰到好处的内存量，这样会更快。这就是这些函数的作用：

def iter_data(fileobj):
    for line in fileobj:
        yield from line.split()

def read_triangular_array(path):
    with open(path) as fileobj:
        n = len(fileobj.readline().split())

    count = int(n*(n+1)/2)

    with open(path) as fileobj:
        return np.fromiter(iter_data(fileobj), int, count=count)

这 "wastes" 有点工作，因为它打开文件两次以读取第一行并获取条目数。 "improvement" 将保存第一行并将其与迭代器链接到文件的其余部分，如以下代码所示：

from itertools import chain

def iter_data(fileobj):
    for line in fileobj:
        yield from line.split()

def read_triangular_array(path):
    with open(path) as fileobj:
        first = fileobj.readline().split()
        n = len(first)
        count = int(n*(n+1)/2)
        data = chain(first, iter_data(fileobj))
        return np.fromiter(data, int, count=count)

所有这些方法都会产生

>>> arr
array([ 3.,  5.,  3.,  5.,  1.,  8.,  1.,  6.,  5.,  8.,  5.,  8.,  1.,
        1.,  6.,  2.,  9.,  6.,  4.,  2.,  0.,  5.,  2.,  1.,  0.,  0.,
        3.,  2.,  2.,  5.,  1.,  0.,  1.,  0.,  1.,  3.,  6.,  3.,  6.,
        1.,  4.,  2.,  4.,  3.,  7.,  4.,  0.,  0.,  1.,  0.,  1.,  8.,
        2.,  1.,  1.])

这种紧凑的表示可能是您所需要的，但是如果您想要完整的方阵，您可以分配一个大小合适的零点矩阵并使用 np.triu_indices_from 将 arr 复制到其中，或者您可以使用 scipy.spatial.distance.squareform:

>>> from scipy.spatial.distance import squareform
>>> squareform(arr)
array([[ 0.,  3.,  5.,  3.,  5.,  1.,  8.,  1.,  6.,  5.,  8.],
       [ 3.,  0.,  5.,  8.,  1.,  1.,  6.,  2.,  9.,  6.,  4.],
       [ 5.,  5.,  0.,  2.,  0.,  5.,  2.,  1.,  0.,  0.,  3.],
       [ 3.,  8.,  2.,  0.,  2.,  2.,  5.,  1.,  0.,  1.,  0.],
       [ 5.,  1.,  0.,  2.,  0.,  1.,  3.,  6.,  3.,  6.,  1.],
       [ 1.,  1.,  5.,  2.,  1.,  0.,  4.,  2.,  4.,  3.,  7.],
       [ 8.,  6.,  2.,  5.,  3.,  4.,  0.,  4.,  0.,  0.,  1.],
       [ 1.,  2.,  1.,  1.,  6.,  2.,  4.,  0.,  0.,  1.,  8.],
       [ 6.,  9.,  0.,  0.,  3.,  4.,  0.,  0.,  0.,  2.,  1.],
       [ 5.,  6.,  0.,  1.,  6.,  3.,  0.,  1.,  2.,  0.,  1.],
       [ 8.,  4.,  3.,  0.,  1.,  7.,  1.,  8.,  1.,  1.,  0.]])

使用 numpy 的 genfromtxt 加载带有 python 的三角矩阵

Using numpy's genfromtxt to load a triangular matrix with python

python

arrays

numpy

matrix

triangular