字符串格式问题（括号与下划线）

Question

我得到了一个包含我所有数据的文本文件

data = 'B:/tempfiles/bla.dat'

我在文本文件中列出了列 header 及其类型

col_headers = [('VW_3_Avg','<f8'),('Lvl_Max(1)','<f8')]

然后创建一个包含选项的字典变量：

kwargs = dict(delimiter=',',\
              deletechars=' ',\
              dtype=col_headers,\
              skip_header=4,\
              skip_footer=0,\
              filling_values='NaN',\
              missing_values={'\"NAN\"'}\
              )

正在将数据导入变量数据文件

datafile = scipy.genfromtxt(datafile, **kwargs)

然后我用

分配数据

VW1 = datafile['VW_3_Avg']
Lv1 = datafile['Lvl_Max(1)']

它适用于第一个（包含下划线），而不适用于第二个（括号）。我得到一个错误，不仅是这个条目，还有所有包含括号的条目：

ValueError: field named Lvl_Max(1) not found

当我将文本文件中的括号更改为下划线时，效果很好。但我不能说为什么它不允许我使用括号 - 而且我无法更改文本文件格式，因为这是外部生成的。当然，我可以用脚本将括号更改为下划线，但我认为正确处理应该不是什么大问题。在这种情况下，我在哪里以及为什么缺少正确的格式优先级？

Answer 1

当你遇到 genfromtxt 的问题时，你应该做的第一件事就是打印 shape 和 dtype.

为什么必须在 col_headers = [('VW_3_Avg','<f8'),('Lvl_Max(1)','<f8')] 中使用 ()？

是否因为文件在 header 中有那些名称？

如果您提供自己的 dtype 并使用 skip_header，则文件中的内容无关紧要。重要的是 dtype 中的字段名称，而不是文件中的字段名称。

我们可以深入研究 dtype 文档并找到允许使用的字符。可以用作 Python 变量名的字段名当然可以。我对 () 会被禁止或出现问题并不感到惊讶，尽管我还没有测试过。

实际上 'Lvl_Max(1)' 作为 dtype 字段名称是可以接受的：

In [235]: col_headers = [('VW_3_Avg','<f8'),('Lvl_Max(1)','<f8')]
In [236]: A=np.zeros((3,),dtype=col_headers)
In [237]: A
Out[237]: 
array([(0.0, 0.0), (0.0, 0.0), (0.0, 0.0)], 
      dtype=[('VW_3_Avg', '<f8'), ('Lvl_Max(1)', '<f8')])
In [238]: A['Lvl_Max(1)']
Out[238]: array([ 0.,  0.,  0.])

您应该从一开始就向我们展示 datafile.shape 和 datafile.dtype。这些 genfromtxt 问题中的 90% 源于对函数 returns.

的误解

让我们尝试使用此 dtype 进行简单的文件读取：

In [239]: txt=b"""1 2
   .....: 3 4
   .....: 5 6
   .....: """
In [240]: np.genfromtxt(txt.splitlines(),dtype=col_headers)
Out[240]: 
array([(1.0, 2.0), (3.0, 4.0), (5.0, 6.0)], 
      dtype=[('VW_3_Avg', '<f8'), ('Lvl_Max1', '<f8')])

看看dtype。 genfromtxt 已经剥离了 '(1)'。看起来像 genfromtxt 'sanitizes' 字段名称，毫无疑问，因为文本文件上的名称可能包含各种有趣的东西。

来自 genfromtxt 文档：

Numpy arrays with a structured dtype can also be viewed as recarray, where a field can be accessed as if it were an attribute. For that reason, we may need to make sure that the field name doesn’t contain any space or invalid character, or that it does not correspond to the name of a standard attribute (like size or shape), which would confuse the interpreter.

genfromtxt 接受一个 deletechars 参数，该参数可以让您控制从字段名称中删除哪些字符。但是它的应用不一致。

In [282]: np.genfromtxt(txt.splitlines(),names=np.dtype(col_headers).names,deletechars=set(b' '),dtype=None)
Out[282]: 
array([(1, 2), (3, 4), (5, 6)], 
      dtype=[('VW_3_Avg', '<i4'), ('Lvl_Max(1)', '<i4')])

In [283]: np.genfromtxt(txt.splitlines(),names=np.dtype(col_headers).names,deletechars=set(b' '))
Out[283]: 
array([(1.0, 2.0), (3.0, 4.0), (5.0, 6.0)], 
      dtype=[('VW_3_Avg', '<f8'), ('Lvl_Max1', '<f8')])

dtype=None 才能正常工作。

默认设置很大：

defaultdeletechars = set("""~!@#$%^&*()-=+~\|]}[{';: /?.>,<""")

问题是 deletechars 传递给了 validator:

validate_names = NameValidator(...
                               deletechars=deletechars,...)

用于从 header 和 names 参数中清除名称。但是随后名称（和 dtype）通过

dtype = easy_dtype(dtype, defaultfmt=defaultfmt, names=names)

没有 deletechars 参数。这个问题大约在一年前得到解决，https://github.com/numpy/numpy/pull/4649，因此可能会在新（est）版本中得到修复。

Answer 2

记录了行为，lib/_iotools.py 中的 NameValidator class 解析传递给 genfromtxt 的名称：

class NameValidator(object):
    """
    Object to validate a list of strings to use as field names.
    The strings are stripped of any non alphanumeric character, and spaces
    are replaced by '_'. During instantiation, the user can define a list
    of names to exclude, as well as a list of invalid characters. Names in
    the exclusion list are appended a '_' character.
    Once an instance has been created, it can be called with a list of
    names, and a list of valid names will be created.  The `__call__`
    method accepts an optional keyword "default" that sets the default name
    in case of ambiguity. By default this is 'f', so that names will
    default to `f0`, `f1`, etc.

您案例中的相关行是 字符串中的任何非字母数字字符都被删除了

您可以通过在名称中包含其他非字母数字字符的列表上调用 NameValidator.validate 来查看行为：

In [17]: from numpy.lib._iotools import NameValidator

In [18]: l = ["foo(1)","bar!!!","foo bar??"]

In [19]: NameValidator().validate(l)
Out[19]: ('foo1', 'bar', 'foo_bar')

同样使用 genfromtxt:

In [24]: datafile = np.genfromtxt("foo.txt", dtype=[('foo!! bar??', '<f8'), ('foo bar bar$', '<f8')], delimiter=",",defaultfmt="%")

In [25]: datafile.dtype
Out[25]: dtype([('foo_bar', '<f8'), ('foo_bar_bar', '<f8')])

字符串格式问题（括号与下划线）

String formatting issue (parantheses vs underline)

python

string

scipy

python-2.7