如何在读取文本文件时更正列之间的空格？

Question

我想从文本文件中读取数据并将其写入 hdf5 格式。但是不知何故，在数据文件的中间，列之间的 space 消失了。 small part of the file 数据看起来像这样：

Generated by trjconv : P/L=1/400 t=   0.00000
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96
    3P2     aP2    3  18.53  -9.69   4.68
    4P2     aP2    4  55.39  74.34   4.60
    5P3     aP3    5  22.11  68.71   3.85
.
.
 9994LI     aLI 9994  24.60  41.14   5.32
 9995LI     aLI 9995  88.47  43.02   5.72
 9996LI     aLI 9996  18.98  40.60   5.56
 9997LI     aLI 9997  35.63  46.43   5.68
 9998LI     aLI 9998  33.81  52.15   5.41
 9999LI     aLI 9999  38.72  57.18   5.32
10000LI     aLI10000  29.36  47.12   5.55
10001LI     aLI10001  82.55  44.80   5.50
10002LI     aLI10002  42.52  51.00   5.19
10003LI     aLI10003  28.61  40.21   5.70
10004LI     aLI10004  38.16  42.85   5.33
Generated by trjconv : P/L=1/400 t=   1000.00
11214
    1P1     aP1    1  80.48  35.36   4.25
    2P1     aP1    2  37.45   3.92   3.96
    3P2     aP2    3  18.53  -9.69   4.68
    4P2     aP2    4  55.39  74.34   4.60
    5P3     aP3    5  22.11  68.71   3.85
.
.
 9994LI     aLI 9994  24.60  41.14   5.32
 9995LI     aLI 9995  88.47  43.02   5.72
 9996LI     aLI 9996  18.98  40.60   5.56
 9997LI     aLI 9997  35.63  46.43   5.68
 9998LI     aLI 9998  33.81  52.15   5.41
 9999LI     aLI 9999  38.72  57.18   5.32
10000LI     aLI10000  29.36  47.12   5.55
10001LI     aLI10001  82.55  44.80   5.50
10002LI     aLI10002  42.52  51.00   5.19
10003LI     aLI10003  28.61  40.21   5.70
10004LI     aLI10004  38.16  42.85   5.33
..
..
..

数据是 t=1000 时的 collection 帧，有一百万帧。正如您在框架末尾看到的那样，第 2 列和第 3 列相互接触。我想在读取数据时在它们之间创建 space 。另一个问题是重复的 headers Generated by..。我怎样才能将它们读写到 hdf5 文件中，因为 h5 文件不支持字符串？有没有办法手动添加它们？这是代码：

import h5py
import numpy as np

#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S4'), ('col2', 'S4'), ('col3', int), 
                   ('col4', float), ('col5', float), ('col6', float)])

# Next, create an empty .h5 file with the dtype
with h5py.File('xaa.h5', 'w') as hdf:
    ds= hdf.create_dataset('dataset1', dtype=gro_dt, shape=(20,), maxshape=(None,)) 

    # Next read line 1 of .gro file
    f = open('xaa', 'r')
    data = f.readlines()
    ds.attrs["Source"]=data[0]
    f.close()

    # loop to read rows from 2 until end
    skip, incr, row0 = 2, 20, 0 
    read_gro = True
    while read_gro:
        arr = np.genfromtxt('xaa', skip_header=skip, max_rows=incr, dtype=gro_dt)
        rows = arr.shape[0]
        if rows == 0:
            read_gro = False 
        else:    
            if row0+rows > ds.shape[0] :
                ds.resize((row0+rows,))
            ds[row0:row0+rows] = arr
            skip += rows
            row0 += rows

我可以跳过第一个header，但是接下来的header如何处理？如果有人需要，我可以提供 headers 的行号。列抛出 valueError

 ValueError: Some errors were detected !
    Line #7 (got 5 columns instead of 6)
    Line #8 (got 5 columns instead of 6)
    Line #9 (got 5 columns instead of 6)

Answer 1

答案更新于 2021-09-09：
根据评论中的要求，我添加了 2 个使用 f.readline() 的新方法。一个用索引对行进行切片，另一个使用 struct 包来解包字段。 struct 应该更快，但我没有发现测试文件（75 个时间步）的性能有显着差异。

此外，我修改了代码以使用 while True: 循环并在文件末尾中断。这避免了输入时间步数的需要。

这是我根据您对上一个问题的回答所遇到的问题而写的答案。（参考：。此答案使用 readlines() 将数据读入列表。（这可能是您的大文件的问题。如果是这样，可以修改解决方案以逐行读取行 readline()。）它使用与字段宽度对齐的索引对每一行的数据进行切片。警告：读取 50e6 行可能需要一段时间。注意：HDF5 支持字符串（但 h5py 不支持 NumPy Unicode 字符串）。

方法 1：使用 f.readlines() 并处理列表。
通过使用索引对每一行进行切片来获取值：

import h5py
import numpy as np

csv_file = 'xaa.txt' # data from link in question

# define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int), 
                   ('col4', float), ('col5', float), ('col6', float)])
 
c1, c2, c3, c4, c5 = 7, 15, 20, 27, 34
# The values above are used as indices to slice line 
# into the following fields in the loop on data[]: 
# [:7], [7:15], [15:20], [20:27], [27:34], [34:]

# Open the file for reading and
# create an empty .h5 file with the dtype above   
with open(csv_file, 'r') as f, \
     h5py.File('xaa.h5', 'w') as hdf:

    data = f.readlines()
    skip = 0
    step = 0
    while True:
        # Read text header line for THIS time step
        if skip == len(data):
            print("End Of File")
            break
        else:
            header = data[skip]
            print(header)
            skip += 1
        
        # get number of data rows
        no_rows = int(data[skip]) 
        skip += 1
        arr = np.empty(shape=(no_rows,), dtype=gro_dt)
        for row, line in enumerate(data[skip:skip+no_rows]):
            arr[row]['col1'] = line[:c1].strip()            
            arr[row]['col2'] = line[c1:c2].strip()            
            arr[row]['col3'] = int(line[c2:c3])
            arr[row]['col4'] = float(line[c3:c4])
            arr[row]['col5'] = float(line[c4:c5])
            arr[row]['col6'] = float(line[c5:])
            
        if arr.shape[0] > 0:
            # create a dataset for THIS time step
            ds= hdf.create_dataset(f'dataset_{step:04}', data=arr) 
            #create attributes for this dataset / time step
            hdr_tokens = header.split()
            ds.attrs['raw_header'] = header
            ds.attrs['Generated by'] = hdr_tokens[2]
            ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
            ds.attrs['Time'] = hdr_tokens[6]
            
        # increment by rows plus footer line that follows
        skip += 1 + no_rows

方法二：使用f.readline()逐行阅读。
通过使用索引对每一行进行切片来获取值：

import h5py
import numpy as np

csv_file = 'xaa.txt'

#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int), 
                   ('col4', float), ('col5', float), ('col6', float)])

## gro_fmt=[0:7], [7:15], [15:20], [20:27], [27:34], [34:41]
c1, c2, c3, c4, c5 = 7, 15, 20, 27, 34
# Open the file for reading and
# create an empty .h5 file with the dtype above   
with open(csv_file, 'r') as f, \
     h5py.File('xaa.h5', 'w') as hdf:

    step = 0
    while True:
    # Read text header line for THIS time step
        header = f.readline()
        if not header:
            print("End Of File")
            break
        else:
            print(header)

        # get number of data rows
        no_rows = int(f.readline()) 
        arr = np.empty(shape=(no_rows,), dtype=gro_dt)
        for row in range(no_rows):
            line = f.readline()
            arr[row]['col1'] = line[:c1].strip()            
            arr[row]['col2'] = line[c1:c2].strip()            
            arr[row]['col3'] = int(line[c2:c3])
            arr[row]['col4'] = float(line[c3:c4])
            arr[row]['col5'] = float(line[c4:c5])
            arr[row]['col6'] = float(line[c5:])
            
        if arr.shape[0] > 0:
            # create a dataset for THIS time step
            ds= hdf.create_dataset(f'dataset_{step:04}', data=arr) 
            #create attributes for this dataset / time step
            print(header)
            hdr_tokens = header.split()
            ds.attrs['raw_header'] = header
            ds.attrs['Generated by'] = hdr_tokens[2]
            ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
            ds.attrs['Time'] = hdr_tokens[6]
            
        footer = f.readline()
        step += 1

方法三：使用f.readlines()逐行阅读。
使用 struct 包从每一行解压值：

import struct
import numpy as np
import h5py

csv_file = 'xaa.txt'

fmtstring = '7s 8s 5s 7s 7s 7s'
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from

#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int), 
                   ('col4', float), ('col5', float), ('col6', float)])

with open(csv_file, 'r') as f, \
     h5py.File('xaa.h5', 'w') as hdf:
         
    step = 0
    while True:         
        header = f.readline()
        if not header:
            print("End Of File")
            break
        else:
            print(header)

        # get number of data rows
        no_rows = int(f.readline())
        arr = np.empty(shape=(no_rows,), dtype=gro_dt)
        for row in range(no_rows):
            fields = parse( f.readline().encode('utf-8') )
            arr[row]['col1'] = fields[0].strip()            
            arr[row]['col2'] = fields[1].strip()            
            arr[row]['col3'] = int(fields[2])
            arr[row]['col4'] = float(fields[3])
            arr[row]['col5'] = float(fields[4])
            arr[row]['col6'] = float(fields[5])
        if arr.shape[0] > 0:
            # create a dataset for THIS time step
            ds= hdf.create_dataset(f'dataset_{step:04}', data=arr) 
            #create attributes for this dataset / time step
            hdr_tokens = header.split()
            ds.attrs['raw_header'] = header
            ds.attrs['Generated by'] = hdr_tokens[2]
            ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
            ds.attrs['Time'] = hdr_tokens[6]
            
        footer = f.readline()
        step += 1

如何在读取文本文件时更正列之间的空格？

How to correct the spaces between the columns while reading a text file?

python

hdf5

h5py