如何读取python中的复杂数据?
How to read complex data in python?
我正在尝试读取结构不合理的数据。看起来像这样
Generated by trjconv : P/L=1/400 t= 0.00000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
3P2 aP2 3 18.53 -9.69 4.68
4P2 aP2 4 55.39 74.34 4.60
5P3 aP3 5 22.11 68.71 3.85
6P3 aP3 6 -4.13 24.04 3.73
7P4 aP4 7 40.16 6.39 4.73
8P4 aP4 8 -5.40 35.73 4.85
9P5 aP5 9 36.67 22.45 4.08
10P5 aP5 10 -3.68 -10.66 4.18
Generated by trjconv : P/L=1/400 t= 1000.000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
3P2 aP2 3 18.53 -9.69 4.68
4P2 aP2 4 55.39 74.34 4.60
5P3 aP3 5 22.11 68.71 3.85
6P3 aP3 6 -4.13 24.04 3.73
7P4 aP4 7 40.16 6.39 4.73
8P4 aP4 8 -5.40 35.73 4.85
9P5 aP5 9 36.67 22.45 4.08
10P5 aP5 10 -3.68 -10.66 4.18
Generated by trjconv : P/L=1/400 t= 2000.000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
3P2 aP2 3 18.53 -9.69 4.68
4P2 aP2 4 55.39 74.34 4.60
5P3 aP3 5 22.11 68.71 3.85
6P3 aP3 6 -4.13 24.04 3.73
7P4 aP4 7 40.16 6.39 4.73
8P4 aP4 8 -5.40 35.73 4.85
9P5 aP5 9 36.67 22.45 4.08
10P5 aP5 10 -3.68 -10.66 4.18
Generated by trjconv : P/L=1/400 t= 3000.000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
3P2 aP2 3 18.53 -9.69 4.68
4P2 aP2 4 55.39 74.34 4.60
5P3 aP3 5 22.11 68.71 3.85
6P3 aP3 6 -4.13 24.04 3.73
7P4 aP4 7 40.16 6.39 4.73
8P4 aP4 8 -5.40 35.73 4.85
9P5 aP5 9 36.67 22.45 4.08
10P5 aP5 10 -3.68 -10.66 4.18
它由具有更新时间的不同帧组成。我在这里展示的只是一个示例。整个文件大约 50GB。因此最好逐行或分块阅读。但是我不知道如何处理每一帧的 headers 。有什么方法可以摆脱这些 header 吗?现在我使用了以下方法:
import numpy as np
#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S4'), ('col2', 'S4'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
file = np.genfromtxt('sample.gro', skip_header = 2, dtype=gro_dt)
但是在下一个 header.
时会抛出以下错误
ValueError: Some errors were detected !
Line #13 (got 7 columns instead of 6)
Line #14 (got 1 columns instead of 6)
Line #25 (got 7 columns instead of 6)
Line #26 (got 1 columns instead of 6)
Line #37 (got 7 columns instead of 6)
Line #38 (got 1 columns instead of 6)
编写一个适配器,去除周期性 headers。
def adapt(f):
for line in f:
if line.startswith("Generated"):
print(line, end='')
# Consume the following line as well.
# If your data is well behaved, you can
# assume the following line exists and should be
# skipped, instead of using the try statement.
try:
print(next(f), end='')
except StopIteration:
pass
continue
yield line
with open('sample.gro') as f:
file = np.genfromtxt(adapt(f), dtype=gro_dt)
因为 genfromtxt
接受生成器函数,也许像这样的转换器函数? (这使 headers 中的 t=
值与第一列保持一致。)
def converter(inf):
current_t = None
for line in inf:
if "trjconv" in line:
current_t = line.partition("t=")[-1].strip()
elif line.startswith(" "):
yield current_t + line
gro_dt = np.dtype(
[
("t", "float"),
("col1", "S4"),
("col2", "S4"),
("col3", int),
("col4", float),
("col5", float),
("col6", float),
]
)
with open("sample.gro") as fp:
file = np.genfromtxt(converter(fp), dtype=gro_dt)
print(file)
输出开始
[( 0., b'1P1', b'aP1', 1, 80.48, 35.36, 4.25)
( 0., b'2P1', b'aP1', 2, 37.45, 3.92, 3.96)
( 0., b'3P2', b'aP2', 3, 18.53, -9.69, 4.68)
( 0., b'4P2', b'aP2', 4, 55.39, 74.34, 4.6 )
( 0., b'5P3', b'aP3', 5, 22.11, 68.71, 3.85)
( 0., b'6P3', b'aP3', 6, -4.13, 24.04, 3.73)
( 0., b'7P4', b'aP4', 7, 40.16, 6.39, 4.73)
( 0., b'8P4', b'aP4', 8, -5.4 , 35.73, 4.85)
( 0., b'9P5', b'aP5', 9, 36.67, 22.45, 4.08)
( 0., b'10P5', b'aP5', 10, -3.68, -10.66, 4.18)
(1000., b'1P1', b'aP1', 1, 80.48, 35.36, 4.25)
(1000., b'2P1', b'aP1', 2, 37.45, 3.92, 3.96)
(1000., b'3P2', b'aP2', 3, 18.53, -9.69, 4.68)
(1000., b'4P2', b'aP2', 4, 55.39, 74.34, 4.6 )
假设您想要收集帧数据(不确定您是否可以为 50 GB 做到这一点..)
下面的代码就是这样做的。
def _is_interesting_line(line_str: str) -> bool:
return line and line_str[0].isspace()
data = []
with open('data.txt') as f:
while True:
line = f.readline()
if not line:
break
interesting = _is_interesting_line(line)
if not interesting:
print(line.strip())
else:
data.append(line.strip())
print('result:')
print(data)
我正在尝试读取结构不合理的数据。看起来像这样
Generated by trjconv : P/L=1/400 t= 0.00000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
3P2 aP2 3 18.53 -9.69 4.68
4P2 aP2 4 55.39 74.34 4.60
5P3 aP3 5 22.11 68.71 3.85
6P3 aP3 6 -4.13 24.04 3.73
7P4 aP4 7 40.16 6.39 4.73
8P4 aP4 8 -5.40 35.73 4.85
9P5 aP5 9 36.67 22.45 4.08
10P5 aP5 10 -3.68 -10.66 4.18
Generated by trjconv : P/L=1/400 t= 1000.000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
3P2 aP2 3 18.53 -9.69 4.68
4P2 aP2 4 55.39 74.34 4.60
5P3 aP3 5 22.11 68.71 3.85
6P3 aP3 6 -4.13 24.04 3.73
7P4 aP4 7 40.16 6.39 4.73
8P4 aP4 8 -5.40 35.73 4.85
9P5 aP5 9 36.67 22.45 4.08
10P5 aP5 10 -3.68 -10.66 4.18
Generated by trjconv : P/L=1/400 t= 2000.000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
3P2 aP2 3 18.53 -9.69 4.68
4P2 aP2 4 55.39 74.34 4.60
5P3 aP3 5 22.11 68.71 3.85
6P3 aP3 6 -4.13 24.04 3.73
7P4 aP4 7 40.16 6.39 4.73
8P4 aP4 8 -5.40 35.73 4.85
9P5 aP5 9 36.67 22.45 4.08
10P5 aP5 10 -3.68 -10.66 4.18
Generated by trjconv : P/L=1/400 t= 3000.000
11214
1P1 aP1 1 80.48 35.36 4.25
2P1 aP1 2 37.45 3.92 3.96
3P2 aP2 3 18.53 -9.69 4.68
4P2 aP2 4 55.39 74.34 4.60
5P3 aP3 5 22.11 68.71 3.85
6P3 aP3 6 -4.13 24.04 3.73
7P4 aP4 7 40.16 6.39 4.73
8P4 aP4 8 -5.40 35.73 4.85
9P5 aP5 9 36.67 22.45 4.08
10P5 aP5 10 -3.68 -10.66 4.18
它由具有更新时间的不同帧组成。我在这里展示的只是一个示例。整个文件大约 50GB。因此最好逐行或分块阅读。但是我不知道如何处理每一帧的 headers 。有什么方法可以摆脱这些 header 吗?现在我使用了以下方法:
import numpy as np
#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S4'), ('col2', 'S4'), ('col3', int),
('col4', float), ('col5', float), ('col6', float)])
file = np.genfromtxt('sample.gro', skip_header = 2, dtype=gro_dt)
但是在下一个 header.
时会抛出以下错误ValueError: Some errors were detected !
Line #13 (got 7 columns instead of 6)
Line #14 (got 1 columns instead of 6)
Line #25 (got 7 columns instead of 6)
Line #26 (got 1 columns instead of 6)
Line #37 (got 7 columns instead of 6)
Line #38 (got 1 columns instead of 6)
编写一个适配器,去除周期性 headers。
def adapt(f):
for line in f:
if line.startswith("Generated"):
print(line, end='')
# Consume the following line as well.
# If your data is well behaved, you can
# assume the following line exists and should be
# skipped, instead of using the try statement.
try:
print(next(f), end='')
except StopIteration:
pass
continue
yield line
with open('sample.gro') as f:
file = np.genfromtxt(adapt(f), dtype=gro_dt)
因为 genfromtxt
接受生成器函数,也许像这样的转换器函数? (这使 headers 中的 t=
值与第一列保持一致。)
def converter(inf):
current_t = None
for line in inf:
if "trjconv" in line:
current_t = line.partition("t=")[-1].strip()
elif line.startswith(" "):
yield current_t + line
gro_dt = np.dtype(
[
("t", "float"),
("col1", "S4"),
("col2", "S4"),
("col3", int),
("col4", float),
("col5", float),
("col6", float),
]
)
with open("sample.gro") as fp:
file = np.genfromtxt(converter(fp), dtype=gro_dt)
print(file)
输出开始
[( 0., b'1P1', b'aP1', 1, 80.48, 35.36, 4.25)
( 0., b'2P1', b'aP1', 2, 37.45, 3.92, 3.96)
( 0., b'3P2', b'aP2', 3, 18.53, -9.69, 4.68)
( 0., b'4P2', b'aP2', 4, 55.39, 74.34, 4.6 )
( 0., b'5P3', b'aP3', 5, 22.11, 68.71, 3.85)
( 0., b'6P3', b'aP3', 6, -4.13, 24.04, 3.73)
( 0., b'7P4', b'aP4', 7, 40.16, 6.39, 4.73)
( 0., b'8P4', b'aP4', 8, -5.4 , 35.73, 4.85)
( 0., b'9P5', b'aP5', 9, 36.67, 22.45, 4.08)
( 0., b'10P5', b'aP5', 10, -3.68, -10.66, 4.18)
(1000., b'1P1', b'aP1', 1, 80.48, 35.36, 4.25)
(1000., b'2P1', b'aP1', 2, 37.45, 3.92, 3.96)
(1000., b'3P2', b'aP2', 3, 18.53, -9.69, 4.68)
(1000., b'4P2', b'aP2', 4, 55.39, 74.34, 4.6 )
假设您想要收集帧数据(不确定您是否可以为 50 GB 做到这一点..)
下面的代码就是这样做的。
def _is_interesting_line(line_str: str) -> bool:
return line and line_str[0].isspace()
data = []
with open('data.txt') as f:
while True:
line = f.readline()
if not line:
break
interesting = _is_interesting_line(line)
if not interesting:
print(line.strip())
else:
data.append(line.strip())
print('result:')
print(data)