当有可变空间分隔列时,在 python (numpy) 中加载数据集

loading a dataset in python (numpy) when there are variable spaces delimiting columns

我有一个包含数字数据的大数据集,在它的某些行中有可变空格分隔列,例如:

4 5 6
7  8    9
2 3 4

当我使用这条线时:

dataset=numpy.loadtxt("dataset.txt", delimiter=" ")

我收到这个错误:

ValueError: Wrong number of columns at line 2

如何更改代码以同时忽略多个空格?

delimiter 的默认值为 'any whitespace'。如果您将 loadtxt 排除在外,它会处理多个空格。

>>> from io import StringIO
>>> dataset = StringIO('''\
... 4 5 6
... 7 8     9
... 2 3 4''')
>>> import numpy
>>> dataset_as_numpy = numpy.loadtxt(dataset)
>>> dataset_as_numpy
array([[ 4.,  5.,  6.],
       [ 7.,  8.,  9.],
       [ 2.,  3.,  4.]])

使用numpy.genfromtxt函数:

>>> import numpy as np
>>> dataset = np.genfromtxt(dataset.txt) 
>>> print dataset
array([[   4.,    5.,    6.],
       [   7.,    8.,   19.],
       [   2.,    3.,    4.],
       [   1.,    3.,  204.]])

这来自 numpy 文档:

By default, genfromtxt assumes delimiter=None, meaning that the line is split along white spaces (including tabs) and that consecutive white spaces are considered as a single white space.

希望对您有所帮助!