如何导入与np.genfromtxt相同的列名数据？

Question

我在文件 data.dat 中有以下形式的数据：

column_1    col col col col col
1   2   3   1   2   3
4   3   2   3   2   4
1   4   3   1   4   3
5   6   4   5   6   4

我正在尝试使用 np.genfromtxt 进行导入，以便所有列名为 col 的数据都存储在变量 y 中。我尝试使用代码：

import numpy as np
data = np.genfromtxt('data.dat', comments='#', delimiter='\t', dtype=None, names=True).transpose()
y = data['col']

但是它给我以下错误：

ValueError: two fields with the same name

如何在 Python 中解决这个问题？

Answer 1

当你使用name=True，np.genfromtxt returns一个structured array。请注意，data.dat 中标记为 col 的列被消歧为 col_n:

形式的列名称

In [114]: arr = np.genfromtxt('data', comments='#', delimiter='\t', dtype=None, names=True)

In [115]: arr
Out[115]: 
array([(1, 2, 3, 1, 2, 3), (4, 3, 2, 3, 2, 4), (1, 4, 3, 1, 4, 3),
       (5, 6, 4, 5, 6, 4)], 
      dtype=[('column_1', '<i8'), ('col', '<i8'), ('col_1', '<i8'), ('col_2', '<i8'), ('col_3', '<i8'), ('col_4', '<i8')])

因此，一旦您使用 names=True，就很难 select 与列名 col 关联的所有数据。此外，结构化数组不允许您一次对多个列进行切片。因此，将数据加载到同质数据类型的数组中会更方便（这是没有 names=True 时你会得到的）：

with open('data.dat', 'rb') as f:
    header = f.readline().strip().split('\t')
    arr = np.genfromtxt(f, comments='#', delimiter='\t', dtype=None)

然后你可以找到名称为col:

的那些列的数字索引

idx = [i for i, col in enumerate(header) if col=='col']

和select所有数据

y = arr[:, idx]

例如，

import numpy as np

with open('data.dat', 'rb') as f:
    header = f.readline().strip().split('\t')
    arr = np.genfromtxt(f, comments='#', delimiter='\t', dtype=None)
    idx = [i for i, col in enumerate(header) if col=='col']
    y = arr[:, idx]
    print(y)

产量

[[2 3 1 2 3]
 [3 2 3 2 4]
 [4 3 1 4 3]
 [6 4 5 6 4]]

如果你想 y 是一维的，你可以使用 ravel():

print(y.ravel())

产量

[2 3 1 2 3 3 2 3 2 4 4 3 1 4 3 6 4 5 6 4]

如何导入与np.genfromtxt相同的列名数据？

How to import same column name data with np.genfromtxt?

python

numpy

genfromtxt