如何在读取文本文件时更正列之间的空格?
How to correct the spaces between the columns while reading a text file?
我想从文本文件中读取数据并将其写入 hdf5 格式。但是不知何故,在数据文件的中间,列之间的 space 消失了。 small part of the file 数据看起来像这样:
Generated by trjconv : P/L=1/400 t= 0.00000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
3P2 aP2 3 18.53 -9.69 4.68
4P2 aP2 4 55.39 74.34 4.60
5P3 aP3 5 22.11 68.71 3.85
.
.
9994LI aLI 9994 24.60 41.14 5.32
9995LI aLI 9995 88.47 43.02 5.72
9996LI aLI 9996 18.98 40.60 5.56
9997LI aLI 9997 35.63 46.43 5.68
9998LI aLI 9998 33.81 52.15 5.41
9999LI aLI 9999 38.72 57.18 5.32
10000LI aLI10000 29.36 47.12 5.55
10001LI aLI10001 82.55 44.80 5.50
10002LI aLI10002 42.52 51.00 5.19
10003LI aLI10003 28.61 40.21 5.70
10004LI aLI10004 38.16 42.85 5.33
Generated by trjconv : P/L=1/400 t= 1000.00
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
3P2 aP2 3 18.53 -9.69 4.68
4P2 aP2 4 55.39 74.34 4.60
5P3 aP3 5 22.11 68.71 3.85
.
.
9994LI aLI 9994 24.60 41.14 5.32
9995LI aLI 9995 88.47 43.02 5.72
9996LI aLI 9996 18.98 40.60 5.56
9997LI aLI 9997 35.63 46.43 5.68
9998LI aLI 9998 33.81 52.15 5.41
9999LI aLI 9999 38.72 57.18 5.32
10000LI aLI10000 29.36 47.12 5.55
10001LI aLI10001 82.55 44.80 5.50
10002LI aLI10002 42.52 51.00 5.19
10003LI aLI10003 28.61 40.21 5.70
10004LI aLI10004 38.16 42.85 5.33
..
..
..
数据是 t=1000 时的 collection 帧,有一百万帧。正如您在框架末尾看到的那样,第 2 列和第 3 列相互接触。我想在读取数据时在它们之间创建 space 。另一个问题是重复的 headers Generated by..。我怎样才能将它们读写到 hdf5 文件中,因为 h5 文件不支持字符串?有没有办法手动添加它们?这是代码:
import h5py
import numpy as np
#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S4'), ('col2', 'S4'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
# Next, create an empty .h5 file with the dtype
with h5py.File('xaa.h5', 'w') as hdf:
ds= hdf.create_dataset('dataset1', dtype=gro_dt, shape=(20,), maxshape=(None,))
# Next read line 1 of .gro file
f = open('xaa', 'r')
data = f.readlines()
ds.attrs["Source"]=data[0]
f.close()
# loop to read rows from 2 until end
skip, incr, row0 = 2, 20, 0
read_gro = True
while read_gro:
arr = np.genfromtxt('xaa', skip_header=skip, max_rows=incr, dtype=gro_dt)
rows = arr.shape[0]
if rows == 0:
read_gro = False
else:
if row0+rows > ds.shape[0] :
ds.resize((row0+rows,))
ds[row0:row0+rows] = arr
skip += rows
row0 += rows
我可以跳过第一个header,但是接下来的header如何处理?如果有人需要,我可以提供 headers 的行号。列抛出 valueError
ValueError: Some errors were detected !
Line #7 (got 5 columns instead of 6)
Line #8 (got 5 columns instead of 6)
Line #9 (got 5 columns instead of 6)
答案更新于 2021-09-09:
根据评论中的要求,我添加了 2 个使用 f.readline()
的新方法。一个用索引对行进行切片,另一个使用 struct
包来解包字段。 struct
应该更快,但我没有发现测试文件(75 个时间步)的性能有显着差异。
此外,我修改了代码以使用 while True:
循环并在文件末尾中断。这避免了输入时间步数的需要。
这是我根据您对上一个问题的回答所遇到的问题而写的答案。 (参考:。此答案使用 readlines()
将数据读入列表。(这可能是您的大文件的问题。如果是这样,可以修改解决方案以逐行读取行 readline()
。)它使用与字段宽度对齐的索引对每一行的数据进行切片。警告:读取 50e6 行可能需要一段时间。注意:HDF5 支持字符串(但 h5py 不支持 NumPy Unicode 字符串)。
方法 1:使用 f.readlines()
并处理列表。
通过使用索引对每一行进行切片来获取值:
import h5py
import numpy as np
csv_file = 'xaa.txt' # data from link in question
# define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
c1, c2, c3, c4, c5 = 7, 15, 20, 27, 34
# The values above are used as indices to slice line
# into the following fields in the loop on data[]:
# [:7], [7:15], [15:20], [20:27], [27:34], [34:]
# Open the file for reading and
# create an empty .h5 file with the dtype above
with open(csv_file, 'r') as f, \
h5py.File('xaa.h5', 'w') as hdf:
data = f.readlines()
skip = 0
step = 0
while True:
# Read text header line for THIS time step
if skip == len(data):
print("End Of File")
break
else:
header = data[skip]
print(header)
skip += 1
# get number of data rows
no_rows = int(data[skip])
skip += 1
arr = np.empty(shape=(no_rows,), dtype=gro_dt)
for row, line in enumerate(data[skip:skip+no_rows]):
arr[row]['col1'] = line[:c1].strip()
arr[row]['col2'] = line[c1:c2].strip()
arr[row]['col3'] = int(line[c2:c3])
arr[row]['col4'] = float(line[c3:c4])
arr[row]['col5'] = float(line[c4:c5])
arr[row]['col6'] = float(line[c5:])
if arr.shape[0] > 0:
# create a dataset for THIS time step
ds= hdf.create_dataset(f'dataset_{step:04}', data=arr)
#create attributes for this dataset / time step
hdr_tokens = header.split()
ds.attrs['raw_header'] = header
ds.attrs['Generated by'] = hdr_tokens[2]
ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
ds.attrs['Time'] = hdr_tokens[6]
# increment by rows plus footer line that follows
skip += 1 + no_rows
方法二:使用f.readline()
逐行阅读。
通过使用索引对每一行进行切片来获取值:
import h5py
import numpy as np
csv_file = 'xaa.txt'
#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
## gro_fmt=[0:7], [7:15], [15:20], [20:27], [27:34], [34:41]
c1, c2, c3, c4, c5 = 7, 15, 20, 27, 34
# Open the file for reading and
# create an empty .h5 file with the dtype above
with open(csv_file, 'r') as f, \
h5py.File('xaa.h5', 'w') as hdf:
step = 0
while True:
# Read text header line for THIS time step
header = f.readline()
if not header:
print("End Of File")
break
else:
print(header)
# get number of data rows
no_rows = int(f.readline())
arr = np.empty(shape=(no_rows,), dtype=gro_dt)
for row in range(no_rows):
line = f.readline()
arr[row]['col1'] = line[:c1].strip()
arr[row]['col2'] = line[c1:c2].strip()
arr[row]['col3'] = int(line[c2:c3])
arr[row]['col4'] = float(line[c3:c4])
arr[row]['col5'] = float(line[c4:c5])
arr[row]['col6'] = float(line[c5:])
if arr.shape[0] > 0:
# create a dataset for THIS time step
ds= hdf.create_dataset(f'dataset_{step:04}', data=arr)
#create attributes for this dataset / time step
print(header)
hdr_tokens = header.split()
ds.attrs['raw_header'] = header
ds.attrs['Generated by'] = hdr_tokens[2]
ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
ds.attrs['Time'] = hdr_tokens[6]
footer = f.readline()
step += 1
方法三:使用f.readlines()
逐行阅读。
使用 struct
包从每一行解压值:
import struct
import numpy as np
import h5py
csv_file = 'xaa.txt'
fmtstring = '7s 8s 5s 7s 7s 7s'
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from
#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
with open(csv_file, 'r') as f, \
h5py.File('xaa.h5', 'w') as hdf:
step = 0
while True:
header = f.readline()
if not header:
print("End Of File")
break
else:
print(header)
# get number of data rows
no_rows = int(f.readline())
arr = np.empty(shape=(no_rows,), dtype=gro_dt)
for row in range(no_rows):
fields = parse( f.readline().encode('utf-8') )
arr[row]['col1'] = fields[0].strip()
arr[row]['col2'] = fields[1].strip()
arr[row]['col3'] = int(fields[2])
arr[row]['col4'] = float(fields[3])
arr[row]['col5'] = float(fields[4])
arr[row]['col6'] = float(fields[5])
if arr.shape[0] > 0:
# create a dataset for THIS time step
ds= hdf.create_dataset(f'dataset_{step:04}', data=arr)
#create attributes for this dataset / time step
hdr_tokens = header.split()
ds.attrs['raw_header'] = header
ds.attrs['Generated by'] = hdr_tokens[2]
ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
ds.attrs['Time'] = hdr_tokens[6]
footer = f.readline()
step += 1
我想从文本文件中读取数据并将其写入 hdf5 格式。但是不知何故,在数据文件的中间,列之间的 space 消失了。 small part of the file 数据看起来像这样:
Generated by trjconv : P/L=1/400 t= 0.00000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
3P2 aP2 3 18.53 -9.69 4.68
4P2 aP2 4 55.39 74.34 4.60
5P3 aP3 5 22.11 68.71 3.85
.
.
9994LI aLI 9994 24.60 41.14 5.32
9995LI aLI 9995 88.47 43.02 5.72
9996LI aLI 9996 18.98 40.60 5.56
9997LI aLI 9997 35.63 46.43 5.68
9998LI aLI 9998 33.81 52.15 5.41
9999LI aLI 9999 38.72 57.18 5.32
10000LI aLI10000 29.36 47.12 5.55
10001LI aLI10001 82.55 44.80 5.50
10002LI aLI10002 42.52 51.00 5.19
10003LI aLI10003 28.61 40.21 5.70
10004LI aLI10004 38.16 42.85 5.33
Generated by trjconv : P/L=1/400 t= 1000.00
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
3P2 aP2 3 18.53 -9.69 4.68
4P2 aP2 4 55.39 74.34 4.60
5P3 aP3 5 22.11 68.71 3.85
.
.
9994LI aLI 9994 24.60 41.14 5.32
9995LI aLI 9995 88.47 43.02 5.72
9996LI aLI 9996 18.98 40.60 5.56
9997LI aLI 9997 35.63 46.43 5.68
9998LI aLI 9998 33.81 52.15 5.41
9999LI aLI 9999 38.72 57.18 5.32
10000LI aLI10000 29.36 47.12 5.55
10001LI aLI10001 82.55 44.80 5.50
10002LI aLI10002 42.52 51.00 5.19
10003LI aLI10003 28.61 40.21 5.70
10004LI aLI10004 38.16 42.85 5.33
..
..
..
数据是 t=1000 时的 collection 帧,有一百万帧。正如您在框架末尾看到的那样,第 2 列和第 3 列相互接触。我想在读取数据时在它们之间创建 space 。另一个问题是重复的 headers Generated by..。我怎样才能将它们读写到 hdf5 文件中,因为 h5 文件不支持字符串?有没有办法手动添加它们?这是代码:
import h5py
import numpy as np
#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S4'), ('col2', 'S4'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
# Next, create an empty .h5 file with the dtype
with h5py.File('xaa.h5', 'w') as hdf:
ds= hdf.create_dataset('dataset1', dtype=gro_dt, shape=(20,), maxshape=(None,))
# Next read line 1 of .gro file
f = open('xaa', 'r')
data = f.readlines()
ds.attrs["Source"]=data[0]
f.close()
# loop to read rows from 2 until end
skip, incr, row0 = 2, 20, 0
read_gro = True
while read_gro:
arr = np.genfromtxt('xaa', skip_header=skip, max_rows=incr, dtype=gro_dt)
rows = arr.shape[0]
if rows == 0:
read_gro = False
else:
if row0+rows > ds.shape[0] :
ds.resize((row0+rows,))
ds[row0:row0+rows] = arr
skip += rows
row0 += rows
我可以跳过第一个header,但是接下来的header如何处理?如果有人需要,我可以提供 headers 的行号。列抛出 valueError
ValueError: Some errors were detected !
Line #7 (got 5 columns instead of 6)
Line #8 (got 5 columns instead of 6)
Line #9 (got 5 columns instead of 6)
答案更新于 2021-09-09:
根据评论中的要求,我添加了 2 个使用 f.readline()
的新方法。一个用索引对行进行切片,另一个使用 struct
包来解包字段。 struct
应该更快,但我没有发现测试文件(75 个时间步)的性能有显着差异。
此外,我修改了代码以使用 while True:
循环并在文件末尾中断。这避免了输入时间步数的需要。
这是我根据您对上一个问题的回答所遇到的问题而写的答案。 (参考:readlines()
将数据读入列表。(这可能是您的大文件的问题。如果是这样,可以修改解决方案以逐行读取行 readline()
。)它使用与字段宽度对齐的索引对每一行的数据进行切片。警告:读取 50e6 行可能需要一段时间。注意:HDF5 支持字符串(但 h5py 不支持 NumPy Unicode 字符串)。
方法 1:使用 f.readlines()
并处理列表。
通过使用索引对每一行进行切片来获取值:
import h5py
import numpy as np
csv_file = 'xaa.txt' # data from link in question
# define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
c1, c2, c3, c4, c5 = 7, 15, 20, 27, 34
# The values above are used as indices to slice line
# into the following fields in the loop on data[]:
# [:7], [7:15], [15:20], [20:27], [27:34], [34:]
# Open the file for reading and
# create an empty .h5 file with the dtype above
with open(csv_file, 'r') as f, \
h5py.File('xaa.h5', 'w') as hdf:
data = f.readlines()
skip = 0
step = 0
while True:
# Read text header line for THIS time step
if skip == len(data):
print("End Of File")
break
else:
header = data[skip]
print(header)
skip += 1
# get number of data rows
no_rows = int(data[skip])
skip += 1
arr = np.empty(shape=(no_rows,), dtype=gro_dt)
for row, line in enumerate(data[skip:skip+no_rows]):
arr[row]['col1'] = line[:c1].strip()
arr[row]['col2'] = line[c1:c2].strip()
arr[row]['col3'] = int(line[c2:c3])
arr[row]['col4'] = float(line[c3:c4])
arr[row]['col5'] = float(line[c4:c5])
arr[row]['col6'] = float(line[c5:])
if arr.shape[0] > 0:
# create a dataset for THIS time step
ds= hdf.create_dataset(f'dataset_{step:04}', data=arr)
#create attributes for this dataset / time step
hdr_tokens = header.split()
ds.attrs['raw_header'] = header
ds.attrs['Generated by'] = hdr_tokens[2]
ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
ds.attrs['Time'] = hdr_tokens[6]
# increment by rows plus footer line that follows
skip += 1 + no_rows
方法二:使用f.readline()
逐行阅读。
通过使用索引对每一行进行切片来获取值:
import h5py
import numpy as np
csv_file = 'xaa.txt'
#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
## gro_fmt=[0:7], [7:15], [15:20], [20:27], [27:34], [34:41]
c1, c2, c3, c4, c5 = 7, 15, 20, 27, 34
# Open the file for reading and
# create an empty .h5 file with the dtype above
with open(csv_file, 'r') as f, \
h5py.File('xaa.h5', 'w') as hdf:
step = 0
while True:
# Read text header line for THIS time step
header = f.readline()
if not header:
print("End Of File")
break
else:
print(header)
# get number of data rows
no_rows = int(f.readline())
arr = np.empty(shape=(no_rows,), dtype=gro_dt)
for row in range(no_rows):
line = f.readline()
arr[row]['col1'] = line[:c1].strip()
arr[row]['col2'] = line[c1:c2].strip()
arr[row]['col3'] = int(line[c2:c3])
arr[row]['col4'] = float(line[c3:c4])
arr[row]['col5'] = float(line[c4:c5])
arr[row]['col6'] = float(line[c5:])
if arr.shape[0] > 0:
# create a dataset for THIS time step
ds= hdf.create_dataset(f'dataset_{step:04}', data=arr)
#create attributes for this dataset / time step
print(header)
hdr_tokens = header.split()
ds.attrs['raw_header'] = header
ds.attrs['Generated by'] = hdr_tokens[2]
ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
ds.attrs['Time'] = hdr_tokens[6]
footer = f.readline()
step += 1
方法三:使用f.readlines()
逐行阅读。
使用 struct
包从每一行解压值:
import struct
import numpy as np
import h5py
csv_file = 'xaa.txt'
fmtstring = '7s 8s 5s 7s 7s 7s'
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from
#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
with open(csv_file, 'r') as f, \
h5py.File('xaa.h5', 'w') as hdf:
step = 0
while True:
header = f.readline()
if not header:
print("End Of File")
break
else:
print(header)
# get number of data rows
no_rows = int(f.readline())
arr = np.empty(shape=(no_rows,), dtype=gro_dt)
for row in range(no_rows):
fields = parse( f.readline().encode('utf-8') )
arr[row]['col1'] = fields[0].strip()
arr[row]['col2'] = fields[1].strip()
arr[row]['col3'] = int(fields[2])
arr[row]['col4'] = float(fields[3])
arr[row]['col5'] = float(fields[4])
arr[row]['col6'] = float(fields[5])
if arr.shape[0] > 0:
# create a dataset for THIS time step
ds= hdf.create_dataset(f'dataset_{step:04}', data=arr)
#create attributes for this dataset / time step
hdr_tokens = header.split()
ds.attrs['raw_header'] = header
ds.attrs['Generated by'] = hdr_tokens[2]
ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
ds.attrs['Time'] = hdr_tokens[6]
footer = f.readline()
step += 1