需要从 python3 中的文本文件中提取表格数据
Need to extract tabular data from a text file in python3
我有一个量子化学程序的输出,我希望从中提取表格数据以输入我大约 25 年前编写的 FORTRAN 程序的 Python 端口。
一些输出文件相当长,多达 6000 行,无法使用电子表格进行处理。
一个典型的 table 的形式是:
CARTESIAN COORDINATES
1 C 0.011987266 -0.003842185 0.006578784
2 H 1.097152909 -0.003956163 0.013339310
3 H -0.349612312 1.019316731 0.001903075
4 H -0.344276148 -0.517463019 -0.880495291
5 H -0.355315644 -0.513266496 0.891567896
我不是要别人为我编写 Python 代码,而是在可用代码的迷宫中给我一些指导。
我建议你研究一下 np.genfromtxt。
以下代码片段将从存储在名为 data.txt
.
的文件中的问题中读取示例数据
import numpy as np
data = np.genfromtxt('data.txt', skip_header=2, dtype=[('id', 'i8'),('label','S1'),('x','f8'),('y','f8'),('z','f8')])
print(data)
输出
[(1, b'C', 0.01198727, -0.00384219, 0.00657878)
(2, b'H', 1.09715291, -0.00395616, 0.01333931)
(3, b'H', -0.34961231, 1.01931673, 0.00190307)
(4, b'H', -0.34427615, -0.51746302, -0.88049529)
(5, b'H', -0.35531564, -0.5132665 , 0.8915679 )]
正则表达式用于从数据中提取内容 - 如果您的表始终定义明确,您可以使用 f.e 提取它们:https://regex101.com/r/QUT2o3/2
import re
regex = r"(\d+ +\w+ (?: +-?\d+\.\d+){3}.+?(?:\n|\Z){2})+"
test_str = (" CARTESIAN COORDINATES\n\n"
" 1 C 0.011987266 -0.003842185 0.006578784\n"
" 2 H 1.097152909 -0.003956163 0.013339310\n"
" 3 H -0.349612312 1.019316731 0.001903075\n"
" 4 H -0.344276148 -0.517463019 -0.880495291\n"
" 5 H -0.355315644 -0.513266496 0.891567896\n\n\n\n"
" CARTESIAN COORDINATES\n\n"
" 1 C 0.011987266 -0.003842185 0.006578784\n"
" 2 H 1.097152909 -0.003956163 0.013339310\n"
" 3 H -0.349612312 1.019316731 0.001903075\n"
" 4 H -0.344276148 -0.517463019 -0.880495291\n"
" 5 H -0.355315644 -0.513266496 0.891567896\n\n\n"
" CARTESIAN COORDINATES\n\n"
" 1 C 0.011987266 -0.003842185 0.006578784\n"
" 2 H 1.097152909 -0.003956163 0.013339310\n"
" 3 H -0.349612312 1.019316731 0.001903075\n"
" 4 H -0.344276148 -0.517463019 -0.880495291\n"
" 5 H -0.355315644 -0.513266496 0.891567896")
应用正则表达式:
matches = re.findall(regex, test_str, re.MULTILINE | re.DOTALL)
for m in matches:
print('\n'.join(x.strip() for x in m.splitlines()))
输出:
1 C 0.011987266 -0.003842185 0.006578784
2 H 1.097152909 -0.003956163 0.013339310
3 H -0.349612312 1.019316731 0.001903075
4 H -0.344276148 -0.517463019 -0.880495291
5 H -0.355315644 -0.513266496 0.891567896
1 C 0.011987266 -0.003842185 0.006578784
2 H 1.097152909 -0.003956163 0.013339310
3 H -0.349612312 1.019316731 0.001903075
4 H -0.344276148 -0.517463019 -0.880495291
5 H -0.355315644 -0.513266496 0.891567896
1 C 0.011987266 -0.003842185 0.006578784
2 H 1.097152909 -0.003956163 0.013339310
3 H -0.349612312 1.019316731 0.001903075
4 H -0.344276148 -0.517463019 -0.880495291
5 H -0.355315644 -0.513266496 0.891567896
我会使用 readlines 和 split。
cc = 'CARTESIAN_COORDINATES.txt'
with open(cc) as data:
lines = data.readlines()[2:] # skip first two lines
for line in lines:
ls = line.split()
a, b, c, d, e = int(ls[0]), ls[1], float(ls[2]), float(ls[3]), float(ls[4])
print(a, b, c, d, e)
输出:
1 C 0.011987266 -0.003842185 0.006578784
2 H 1.097152909 -0.003956163 0.01333931
3 H -0.349612312 1.019316731 0.001903075
4 H -0.344276148 -0.517463019 -0.880495291
5 H -0.355315644 -0.513266496 0.891567896
我有一个量子化学程序的输出,我希望从中提取表格数据以输入我大约 25 年前编写的 FORTRAN 程序的 Python 端口。
一些输出文件相当长,多达 6000 行,无法使用电子表格进行处理。
一个典型的 table 的形式是:
CARTESIAN COORDINATES
1 C 0.011987266 -0.003842185 0.006578784
2 H 1.097152909 -0.003956163 0.013339310
3 H -0.349612312 1.019316731 0.001903075
4 H -0.344276148 -0.517463019 -0.880495291
5 H -0.355315644 -0.513266496 0.891567896
我不是要别人为我编写 Python 代码,而是在可用代码的迷宫中给我一些指导。
我建议你研究一下 np.genfromtxt。
以下代码片段将从存储在名为 data.txt
.
import numpy as np
data = np.genfromtxt('data.txt', skip_header=2, dtype=[('id', 'i8'),('label','S1'),('x','f8'),('y','f8'),('z','f8')])
print(data)
输出
[(1, b'C', 0.01198727, -0.00384219, 0.00657878)
(2, b'H', 1.09715291, -0.00395616, 0.01333931)
(3, b'H', -0.34961231, 1.01931673, 0.00190307)
(4, b'H', -0.34427615, -0.51746302, -0.88049529)
(5, b'H', -0.35531564, -0.5132665 , 0.8915679 )]
正则表达式用于从数据中提取内容 - 如果您的表始终定义明确,您可以使用 f.e 提取它们:https://regex101.com/r/QUT2o3/2
import re
regex = r"(\d+ +\w+ (?: +-?\d+\.\d+){3}.+?(?:\n|\Z){2})+"
test_str = (" CARTESIAN COORDINATES\n\n"
" 1 C 0.011987266 -0.003842185 0.006578784\n"
" 2 H 1.097152909 -0.003956163 0.013339310\n"
" 3 H -0.349612312 1.019316731 0.001903075\n"
" 4 H -0.344276148 -0.517463019 -0.880495291\n"
" 5 H -0.355315644 -0.513266496 0.891567896\n\n\n\n"
" CARTESIAN COORDINATES\n\n"
" 1 C 0.011987266 -0.003842185 0.006578784\n"
" 2 H 1.097152909 -0.003956163 0.013339310\n"
" 3 H -0.349612312 1.019316731 0.001903075\n"
" 4 H -0.344276148 -0.517463019 -0.880495291\n"
" 5 H -0.355315644 -0.513266496 0.891567896\n\n\n"
" CARTESIAN COORDINATES\n\n"
" 1 C 0.011987266 -0.003842185 0.006578784\n"
" 2 H 1.097152909 -0.003956163 0.013339310\n"
" 3 H -0.349612312 1.019316731 0.001903075\n"
" 4 H -0.344276148 -0.517463019 -0.880495291\n"
" 5 H -0.355315644 -0.513266496 0.891567896")
应用正则表达式:
matches = re.findall(regex, test_str, re.MULTILINE | re.DOTALL)
for m in matches:
print('\n'.join(x.strip() for x in m.splitlines()))
输出:
1 C 0.011987266 -0.003842185 0.006578784
2 H 1.097152909 -0.003956163 0.013339310
3 H -0.349612312 1.019316731 0.001903075
4 H -0.344276148 -0.517463019 -0.880495291
5 H -0.355315644 -0.513266496 0.891567896
1 C 0.011987266 -0.003842185 0.006578784
2 H 1.097152909 -0.003956163 0.013339310
3 H -0.349612312 1.019316731 0.001903075
4 H -0.344276148 -0.517463019 -0.880495291
5 H -0.355315644 -0.513266496 0.891567896
1 C 0.011987266 -0.003842185 0.006578784
2 H 1.097152909 -0.003956163 0.013339310
3 H -0.349612312 1.019316731 0.001903075
4 H -0.344276148 -0.517463019 -0.880495291
5 H -0.355315644 -0.513266496 0.891567896
我会使用 readlines 和 split。
cc = 'CARTESIAN_COORDINATES.txt'
with open(cc) as data:
lines = data.readlines()[2:] # skip first two lines
for line in lines:
ls = line.split()
a, b, c, d, e = int(ls[0]), ls[1], float(ls[2]), float(ls[3]), float(ls[4])
print(a, b, c, d, e)
输出:
1 C 0.011987266 -0.003842185 0.006578784
2 H 1.097152909 -0.003956163 0.01333931
3 H -0.349612312 1.019316731 0.001903075
4 H -0.344276148 -0.517463019 -0.880495291
5 H -0.355315644 -0.513266496 0.891567896