提取和处理两个字符串之间的信息,这些字符串在文件中重复多次
Extract and process information between two strings, being these strings repeated multiple times along the file
我有一个结构如下的文件:
LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM
PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 122.771603 - DENSITY 2.704 g/cm^3
A B C ALPHA BETA GAMMA
6.32540491 6.32540491 6.32540491 46.774144 46.774144 46.774144
*******************************************************************************
ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10
ATOM X/A Y/B Z/C
*******************************************************************************
1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
2 F 20 CA -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01
3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01
4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01
5 T 8 O -4.912600492192E-01 -8.739950780750E-03 2.500000000000E-01
6 F 8 O 2.500000000000E-01 -4.912600492193E-01 -8.739950780750E-03
7 F 8 O -8.739950780750E-03 2.500000000000E-01 -4.912600492193E-01
8 F 8 O 4.912600492193E-01 8.739950780750E-03 -2.500000000000E-01
9 F 8 O -2.500000000000E-01 4.912600492193E-01 8.739950780750E-03
10 F 8 O 8.739950780750E-03 -2.500000000000E-01 4.912600492193E-01
TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL
1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000
*******************************************************************************
CRYSTALLOGRAPHIC CELL (VOLUME= 368.31480902)
A B C ALPHA BETA GAMMA
5.02162261 5.02162261 16.86554607 90.000000 90.000000 120.000000
COORDINATES IN THE CRYSTALLOGRAPHIC CELL
ATOM X/A Y/B Z/C
*******************************************************************************
1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
2 F 20 CA 0.000000000000E+00 0.000000000000E+00 -5.000000000000E-01
3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02
4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02
5 T 8 O -4.079267158859E-01 -3.333333333333E-01 -8.333333333333E-02
6 F 8 O 3.333333333333E-01 -7.459338255258E-02 -8.333333333333E-02
7 F 8 O 7.459338255258E-02 4.079267158859E-01 -8.333333333333E-02
8 F 8 O 4.079267158859E-01 3.333333333333E-01 8.333333333333E-02
9 F 8 O -3.333333333333E-01 7.459338255258E-02 8.333333333333E-02
10 F 8 O -7.459338255258E-02 -4.079267158859E-01 8.333333333333E-02
T = ATOM BELONGING TO THE ASYMMETRIC UNIT
more lines
more lines
more lines
FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3
(NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500)
*******************************************************************************
LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM
PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 119.823364 - DENSITY 2.770 g/cm^3
A B C ALPHA BETA GAMMA
6.28373604 6.28373604 6.28373604 46.646397 46.646397 46.646397
*******************************************************************************
ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10
ATOM X/A Y/B Z/C
*******************************************************************************
1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
2 F 20 CA -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01
3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01
4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01
5 T 8 O -4.924094276183E-01 -7.590572381674E-03 2.500000000000E-01
6 F 8 O 2.500000000000E-01 -4.924094276183E-01 -7.590572381674E-03
7 F 8 O -7.590572381674E-03 2.500000000000E-01 -4.924094276183E-01
8 F 8 O 4.924094276183E-01 7.590572381674E-03 -2.500000000000E-01
9 F 8 O -2.500000000000E-01 4.924094276183E-01 7.590572381674E-03
10 F 8 O 7.590572381674E-03 -2.500000000000E-01 4.924094276183E-01
TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL
1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000
*******************************************************************************
CRYSTALLOGRAPHIC CELL (VOLUME= 359.47009054)
A B C ALPHA BETA GAMMA
4.97568007 4.97568007 16.76591397 90.000000 90.000000 120.000000
COORDINATES IN THE CRYSTALLOGRAPHIC CELL
ATOM X/A Y/B Z/C
*******************************************************************************
1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
2 F 20 CA -5.491739570355E-17 -2.745869785177E-17 -5.000000000000E-01
3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02
4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02
5 T 8 O -4.090760942850E-01 -3.333333333333E-01 -8.333333333333E-02
6 F 8 O 3.333333333333E-01 -7.574276095166E-02 -8.333333333333E-02
7 F 8 O 7.574276095166E-02 4.090760942850E-01 -8.333333333333E-02
8 F 8 O 4.090760942850E-01 3.333333333333E-01 8.333333333333E-02
9 F 8 O -3.333333333333E-01 7.574276095166E-02 8.333333333333E-02
10 F 8 O -7.574276095166E-02 -4.090760942850E-01 8.333333333333E-02
T = ATOM BELONGING TO THE ASYMMETRIC UNIT
INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE
more lines
more lines
more lines
FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3
(NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500)
*******************************************************************************
LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM
PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 121.143469 - DENSITY 2.740 g/cm^3
A B C ALPHA BETA GAMMA
6.32229536 6.32229536 6.32229536 46.436583 46.436583 46.436583
*******************************************************************************
ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10
ATOM X/A Y/B Z/C
*******************************************************************************
1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
2 F 20 CA 5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01
3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01
4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01
5 T 8 O -4.927088991116E-01 -7.291100888437E-03 2.500000000000E-01
6 F 8 O 2.500000000000E-01 -4.927088991116E-01 -7.291100888437E-03
7 F 8 O -7.291100888437E-03 2.500000000000E-01 -4.927088991116E-01
8 F 8 O 4.927088991116E-01 7.291100888437E-03 -2.500000000000E-01
9 F 8 O -2.500000000000E-01 4.927088991116E-01 7.291100888437E-03
10 F 8 O 7.291100888437E-03 -2.500000000000E-01 4.927088991116E-01
TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL
1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000
*******************************************************************************
CRYSTALLOGRAPHIC CELL (VOLUME= 363.43040599)
A B C ALPHA BETA GAMMA
4.98494429 4.98494429 16.88768068 90.000000 90.000000 120.000000
COORDINATES IN THE CRYSTALLOGRAPHIC CELL
ATOM X/A Y/B Z/C
*******************************************************************************
1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
2 F 20 CA -5.471726358381E-17 -2.735863179191E-17 -5.000000000000E-01
3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02
4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02
5 T 8 O -4.093755657782E-01 -3.333333333333E-01 -8.333333333333E-02
6 F 8 O 3.333333333333E-01 -7.604223244490E-02 -8.333333333333E-02
7 F 8 O 7.604223244490E-02 4.093755657782E-01 -8.333333333333E-02
8 F 8 O 4.093755657782E-01 3.333333333333E-01 8.333333333333E-02
9 F 8 O -3.333333333333E-01 7.604223244490E-02 8.333333333333E-02
10 F 8 O -7.604223244490E-02 -4.093755657782E-01 8.333333333333E-02
T = ATOM BELONGING TO THE ASYMMETRIC UNIT
INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE
more lines
more lines
more lines
我想提取CRYSTALLOGRAPHIC CELL
的信息;但只有来自 FINAL OPTIMIZED GEOMETRY
的那个。
以下 3 场比赛:
initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3$'
middle_pattern = '^ CRYSTALLOGRAPHIC CELL '
end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$'
允许搜索信息。
首先,我定义了一个flag passed_mid_point = False
,
然后程序的以下部分提取 FINAL OPTIMIZED GEOMETRY
的 CRYSTALLOGRAPHIC CELL
的 VOLUME
:
VOLUMES = []
with open('g.out') as file:
passed_mid_point = False
for line in file:
if re.match(initial_pattern, line):
passed_mid_point = False
print file.next()
print file.next()
print file.next()
volume_line = file.next()
print volume_line
aux = volume_line.split()
each_volume = aux[7]
print each_volume
VOLUMES.append(each_volume)
print 'VOLUMES = ', VOLUMES
这是正确的,因为 VOLUMES = ['119.823364', '121.143469']
。请注意,最初的 122.771603
(请参阅原始文件)没有像预期的那样聚集在列表中。
提取A
和C
时(在我的程序中,P0
和P1
),FINAL OPTIMIZED GEOMETRY
的参数CRYSTALLOGRAPHIC CELL
,连同坐标:
if re.match(middle_pattern, line):
passed_mid_point = True
print line
print file.next()
parameters_line = file.next()
aux = parameters_line.split()
p0 = aux[0]
p1 = aux[1]
p2 = aux[2]
p3 = aux[3]
p4 = aux[4]
p5 = aux[5] #
print p0
print p2
P0.append(p0)
P2.append(p2)
print file.next()
print file.next()
print file.next()
print file.next()
if re.match(end_pattern, line):
passed_mid_point = False
elif passed_mid_point:
# parse the coordinates
print 'line2 =', line
terms = line.split()
print 'terms =', terms
# print 'terms[1] =', terms[1]
if terms and terms[1] == 'T':
print terms[1]
atomic_number = terms[2]
print 'atomic_number = ', atomic_number
ATOMIC_NUMBERS.append(atomic_number)
x = terms[4]
print 'x =', x
Xs.append(x)
y = terms[5]
print 'y = ', y
Ys.append(y)
z = terms[6]
print 'z = ', z
Zs.append(z)
print 'VOLUMES = ', VOLUMES
print 'P0 = ', P0
print 'P2 = ', P2
print 'Xs = ', Xs
print 'Ys = ', Ys
print 'Zs = ', Zs
print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS
结果如下:
P0 = ['5.02162261', '4.97568007', '4.98494429']
这是错误的,因为 5.02162261
不是来自 FINAL OPTIMIZED GEOMETRY
(见文件)。
还有坐标错误:
Xs = ['0.000000000000E+00', '3.333333333333E-01', '-4.079267158859E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01']
Ys = ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01']
Zs = ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']
ATOMIC_NUMBERS = ['20', '6', '8', '20', '6', '8', '20', '6', '8']
这将是期望的结果:
VOLUMES = ['119.823364', '121.143469']
P0 = ['4.97568007', '4.98494429']
P1 = [16.76591397, '16.88768068']
Xs = ['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01']
Ys = ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01']
Zs = ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']
ATOMIC_NUMBERS = ['20', '6', '8', '20', '6', '8']
如果你能帮助我,我将不胜感激
整个代码:
import sys
import re
import os
initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3$'
middle_pattern = '^ CRYSTALLOGRAPHIC CELL '
end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$'
global N_atom_irreducible_unit
N_atom_irreducible_unit = 3
VOLUMES = []
P0 = []
P2 = []
ATOMIC_NUMBERS = []
Xs = []
Ys = []
Zs = []
with open('g.out') as file:
passed_mid_point = False
for line in file:
if re.match(initial_pattern, line):
passed_mid_point = False
print file.next()
print file.next()
print file.next()
volume_line = file.next()
print volume_line
aux = volume_line.split()
each_volume = aux[7]
print each_volume
VOLUMES.append(each_volume)
if re.match(middle_pattern, line):
passed_mid_point = True
print line
print file.next()
parameters_line = file.next()
aux = parameters_line.split()
p0 = aux[0]
p1 = aux[1]
p2 = aux[2]
p3 = aux[3]
p4 = aux[4]
p5 = aux[5] #
print p0
print p2
P0.append(p0)
P2.append(p2)
print file.next()
print file.next()
print file.next()
print file.next()
if re.match(end_pattern, line):
passed_mid_point = False
elif passed_mid_point:
# parse the coordinates
print 'line2 =', line
terms = line.split()
print 'terms =', terms
# print 'terms[1] =', terms[1]
if terms and terms[1] == 'T':
print terms[1]
atomic_number = terms[2]
print 'atomic_number = ', atomic_number
ATOMIC_NUMBERS.append(atomic_number)
x = terms[4]
print 'x =', x
Xs.append(x)
y = terms[5]
print 'y = ', y
Ys.append(y)
z = terms[6]
print 'z = ', z
Zs.append(z)
print 'VOLUMES = ', VOLUMES
print 'P0 = ', P0
print 'P2 = ', P2
print 'Xs = ', Xs
print 'Ys = ', Ys
print 'Zs = ', Zs
print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS
我为您的脚本编写了一个简化版本,它似乎可以工作。我希望这可以作为您最终脚本的起点:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
VOLUMES = []
P0 = []
P2 = []
ATOMIC_NUMBERS = []
Xs = []
Ys = []
Zs = []
with open('g.out') as gout:
final_optimized_geometry = False
for line in gout:
if 'FINAL OPTIMIZED GEOMETRY' in line:
final_optimized_geometry = True
elif 'PRIMITIVE CELL' in line:
if not final_optimized_geometry:
continue
volume = line.split()[7]
VOLUMES.append(volume)
elif 'CRYSTALLOGRAPHIC CELL (VOLUME=' in line:
if not final_optimized_geometry:
continue
gout.readline()
line = gout.readline()
p0, p2 = line.split()[0:3:2]
P0.append(p0)
P2.append(p2)
elif 'COORDINATES IN THE CRYSTALLOGRAPHIC CELL' in line:
if not final_optimized_geometry:
continue
gout.readline()
gout.readline()
while 'T = ATOM BELONGING TO THE ASYMMETRIC UNIT' not in line:
line = gout.readline()
atomdata = line.split()
if not atomdata or atomdata[1] != 'T':
continue
atomicnumber = atomdata[2]
x, y, z = atomdata[4:7]
ATOMIC_NUMBERS.append(atomicnumber)
Xs.append(x)
Ys.append(y)
Zs.append(z)
final_optimized_geometry = False
print(VOLUMES)
print(P0)
print(P2)
print(ATOMIC_NUMBERS)
print(Xs)
print(Ys)
print(Zs)
这会生成以下输出:
['119.823364', '121.143469']
['4.97568007', '4.98494429']
['16.76591397', '16.88768068']
['20', '6', '8', '20', '6', '8']
['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01']
['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01']
['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']
事实上,它是一个非常简单的有限状态机,只有两个状态。警告:如果在一个最终优化的几何结构中有多个晶体单元,它将不起作用。在这种情况下,它只会捕获第一个单元格的信息。
该代码还对文件进行了其他假设,当然这可能需要验证。
我避免使用正则表达式。
此代码只会在 Python 3 中 运行(针对 Python 3.6.2 进行测试)。 Python 2.7 会因为在文件迭代块中使用 readline()
而窒息(这有点道理,但很高兴看到 Python 3 可以接受)。我们正在使用 readline()
作为一个小 hack 来跳过输入文件中我们知道必须跳过的行,而无需再次经历整个循环(这将需要更多标志变量)。
顺便说一句,如果您的唯一任务是解析文本文件,那么查看专用语言可能会很有趣,例如 Lex。此外,Perl 就是为做这样的事情而设计的,而不是 Python。
希望对您有所帮助!
感谢所有@Bart Van Loon 的帮助,代码的更简单版本是:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
global N_atom_irreducible_unit
N_atom_irreducible_unit = 3
filename = 'g.out'
VOLUMES = []
P0 = []
P2 = []
ATOMIC_NUMBERS = []
Xs = []
Ys = []
Zs = []
with open(filename) as gout:
final_optimized_geometry = False
for line in gout:
if 'FINAL OPTIMIZED GEOMETRY' in line:
final_optimized_geometry = True
elif 'PRIMITIVE CELL - CENTRING CODE' in line:
if final_optimized_geometry:
volume = line.split()
print volume
print volume[7]
volume = line.split()[7]
VOLUMES.append(volume)
elif ' CRYSTALLOGRAPHIC CELL (V' in line:
if final_optimized_geometry:
print 'gout.next() =', gout.next()
done = gout.next()
print 'done =', done
p0 = done.split()[0]
p2 = done.split()[2]
# p0, p2 = done.split()[0:3:2]
P0.append(p0)
P2.append(p2)
elif 'COORDINATES IN THE CRYSTALLOGRAPHIC CELL' in line:
if final_optimized_geometry:
gout.next()
gout.next()
while True:
line = gout.next()
atomdata = line.split()
if not atomdata:
break
if atomdata[1] != 'T':
continue
atomicnumber = atomdata[2]
x, y, z = atomdata[4:7]
ATOMIC_NUMBERS.append(atomicnumber)
Xs.append(x)
Ys.append(y)
Zs.append(z)
final_optimized_geometry = False
print 'VOLUMES = ', VOLUMES
print 'P0 = ', P0
print 'P2 = ', P2
print 'Xs = ', Xs
print 'Ys = ', Ys
print 'Zs = ', Zs
print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS
其中:
1) 因为最后一个原子(本例中为第 10 个原子)之后的下一行是空行,
if not atomdata:
break
总是在atomdata
为空时停止。换句话说,这将始终在空行时停止,即当原子列表结束时。因此,这将允许避免 while 'T = ATOM BELONGING TO THE ASYMMETRIC UNIT' not in line:
语句。
类似的说法是:
if atomdata:
continue
但是,出于某种我不明白的原因,这无法将非空行解释为唯一要分析的行。为什么?
2)这部分代码:
if atomdata[1] != 'T':
continue
atomicnumber = atomdata[2]
x, y, z = atomdata[4:7]
ATOMIC_NUMBERS.append(atomicnumber)
Xs.append(x)
Ys.append(y)
Zs.append(z)
也可以表示为:
if atomdata[1] == 'T':
atomicnumber = atomdata[2]
x, y, z = atomdata[4:7]
ATOMIC_NUMBERS.append(atomicnumber)
Xs.append(x)
Ys.append(y)
Zs.append(z)
我有一个结构如下的文件:
LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM
PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 122.771603 - DENSITY 2.704 g/cm^3
A B C ALPHA BETA GAMMA
6.32540491 6.32540491 6.32540491 46.774144 46.774144 46.774144
*******************************************************************************
ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10
ATOM X/A Y/B Z/C
*******************************************************************************
1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
2 F 20 CA -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01
3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01
4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01
5 T 8 O -4.912600492192E-01 -8.739950780750E-03 2.500000000000E-01
6 F 8 O 2.500000000000E-01 -4.912600492193E-01 -8.739950780750E-03
7 F 8 O -8.739950780750E-03 2.500000000000E-01 -4.912600492193E-01
8 F 8 O 4.912600492193E-01 8.739950780750E-03 -2.500000000000E-01
9 F 8 O -2.500000000000E-01 4.912600492193E-01 8.739950780750E-03
10 F 8 O 8.739950780750E-03 -2.500000000000E-01 4.912600492193E-01
TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL
1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000
*******************************************************************************
CRYSTALLOGRAPHIC CELL (VOLUME= 368.31480902)
A B C ALPHA BETA GAMMA
5.02162261 5.02162261 16.86554607 90.000000 90.000000 120.000000
COORDINATES IN THE CRYSTALLOGRAPHIC CELL
ATOM X/A Y/B Z/C
*******************************************************************************
1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
2 F 20 CA 0.000000000000E+00 0.000000000000E+00 -5.000000000000E-01
3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02
4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02
5 T 8 O -4.079267158859E-01 -3.333333333333E-01 -8.333333333333E-02
6 F 8 O 3.333333333333E-01 -7.459338255258E-02 -8.333333333333E-02
7 F 8 O 7.459338255258E-02 4.079267158859E-01 -8.333333333333E-02
8 F 8 O 4.079267158859E-01 3.333333333333E-01 8.333333333333E-02
9 F 8 O -3.333333333333E-01 7.459338255258E-02 8.333333333333E-02
10 F 8 O -7.459338255258E-02 -4.079267158859E-01 8.333333333333E-02
T = ATOM BELONGING TO THE ASYMMETRIC UNIT
more lines
more lines
more lines
FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3
(NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500)
*******************************************************************************
LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM
PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 119.823364 - DENSITY 2.770 g/cm^3
A B C ALPHA BETA GAMMA
6.28373604 6.28373604 6.28373604 46.646397 46.646397 46.646397
*******************************************************************************
ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10
ATOM X/A Y/B Z/C
*******************************************************************************
1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
2 F 20 CA -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01
3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01
4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01
5 T 8 O -4.924094276183E-01 -7.590572381674E-03 2.500000000000E-01
6 F 8 O 2.500000000000E-01 -4.924094276183E-01 -7.590572381674E-03
7 F 8 O -7.590572381674E-03 2.500000000000E-01 -4.924094276183E-01
8 F 8 O 4.924094276183E-01 7.590572381674E-03 -2.500000000000E-01
9 F 8 O -2.500000000000E-01 4.924094276183E-01 7.590572381674E-03
10 F 8 O 7.590572381674E-03 -2.500000000000E-01 4.924094276183E-01
TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL
1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000
*******************************************************************************
CRYSTALLOGRAPHIC CELL (VOLUME= 359.47009054)
A B C ALPHA BETA GAMMA
4.97568007 4.97568007 16.76591397 90.000000 90.000000 120.000000
COORDINATES IN THE CRYSTALLOGRAPHIC CELL
ATOM X/A Y/B Z/C
*******************************************************************************
1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
2 F 20 CA -5.491739570355E-17 -2.745869785177E-17 -5.000000000000E-01
3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02
4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02
5 T 8 O -4.090760942850E-01 -3.333333333333E-01 -8.333333333333E-02
6 F 8 O 3.333333333333E-01 -7.574276095166E-02 -8.333333333333E-02
7 F 8 O 7.574276095166E-02 4.090760942850E-01 -8.333333333333E-02
8 F 8 O 4.090760942850E-01 3.333333333333E-01 8.333333333333E-02
9 F 8 O -3.333333333333E-01 7.574276095166E-02 8.333333333333E-02
10 F 8 O -7.574276095166E-02 -4.090760942850E-01 8.333333333333E-02
T = ATOM BELONGING TO THE ASYMMETRIC UNIT
INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE
more lines
more lines
more lines
FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3
(NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500)
*******************************************************************************
LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM
PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 121.143469 - DENSITY 2.740 g/cm^3
A B C ALPHA BETA GAMMA
6.32229536 6.32229536 6.32229536 46.436583 46.436583 46.436583
*******************************************************************************
ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10
ATOM X/A Y/B Z/C
*******************************************************************************
1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
2 F 20 CA 5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01
3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01
4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01
5 T 8 O -4.927088991116E-01 -7.291100888437E-03 2.500000000000E-01
6 F 8 O 2.500000000000E-01 -4.927088991116E-01 -7.291100888437E-03
7 F 8 O -7.291100888437E-03 2.500000000000E-01 -4.927088991116E-01
8 F 8 O 4.927088991116E-01 7.291100888437E-03 -2.500000000000E-01
9 F 8 O -2.500000000000E-01 4.927088991116E-01 7.291100888437E-03
10 F 8 O 7.291100888437E-03 -2.500000000000E-01 4.927088991116E-01
TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL
1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000
*******************************************************************************
CRYSTALLOGRAPHIC CELL (VOLUME= 363.43040599)
A B C ALPHA BETA GAMMA
4.98494429 4.98494429 16.88768068 90.000000 90.000000 120.000000
COORDINATES IN THE CRYSTALLOGRAPHIC CELL
ATOM X/A Y/B Z/C
*******************************************************************************
1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00
2 F 20 CA -5.471726358381E-17 -2.735863179191E-17 -5.000000000000E-01
3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02
4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02
5 T 8 O -4.093755657782E-01 -3.333333333333E-01 -8.333333333333E-02
6 F 8 O 3.333333333333E-01 -7.604223244490E-02 -8.333333333333E-02
7 F 8 O 7.604223244490E-02 4.093755657782E-01 -8.333333333333E-02
8 F 8 O 4.093755657782E-01 3.333333333333E-01 8.333333333333E-02
9 F 8 O -3.333333333333E-01 7.604223244490E-02 8.333333333333E-02
10 F 8 O -7.604223244490E-02 -4.093755657782E-01 8.333333333333E-02
T = ATOM BELONGING TO THE ASYMMETRIC UNIT
INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE
more lines
more lines
more lines
我想提取CRYSTALLOGRAPHIC CELL
的信息;但只有来自 FINAL OPTIMIZED GEOMETRY
的那个。
以下 3 场比赛:
initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3$'
middle_pattern = '^ CRYSTALLOGRAPHIC CELL '
end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$'
允许搜索信息。
首先,我定义了一个flag passed_mid_point = False
,
然后程序的以下部分提取 FINAL OPTIMIZED GEOMETRY
的 CRYSTALLOGRAPHIC CELL
的 VOLUME
:
VOLUMES = []
with open('g.out') as file:
passed_mid_point = False
for line in file:
if re.match(initial_pattern, line):
passed_mid_point = False
print file.next()
print file.next()
print file.next()
volume_line = file.next()
print volume_line
aux = volume_line.split()
each_volume = aux[7]
print each_volume
VOLUMES.append(each_volume)
print 'VOLUMES = ', VOLUMES
这是正确的,因为 VOLUMES = ['119.823364', '121.143469']
。请注意,最初的 122.771603
(请参阅原始文件)没有像预期的那样聚集在列表中。
提取A
和C
时(在我的程序中,P0
和P1
),FINAL OPTIMIZED GEOMETRY
的参数CRYSTALLOGRAPHIC CELL
,连同坐标:
if re.match(middle_pattern, line):
passed_mid_point = True
print line
print file.next()
parameters_line = file.next()
aux = parameters_line.split()
p0 = aux[0]
p1 = aux[1]
p2 = aux[2]
p3 = aux[3]
p4 = aux[4]
p5 = aux[5] #
print p0
print p2
P0.append(p0)
P2.append(p2)
print file.next()
print file.next()
print file.next()
print file.next()
if re.match(end_pattern, line):
passed_mid_point = False
elif passed_mid_point:
# parse the coordinates
print 'line2 =', line
terms = line.split()
print 'terms =', terms
# print 'terms[1] =', terms[1]
if terms and terms[1] == 'T':
print terms[1]
atomic_number = terms[2]
print 'atomic_number = ', atomic_number
ATOMIC_NUMBERS.append(atomic_number)
x = terms[4]
print 'x =', x
Xs.append(x)
y = terms[5]
print 'y = ', y
Ys.append(y)
z = terms[6]
print 'z = ', z
Zs.append(z)
print 'VOLUMES = ', VOLUMES
print 'P0 = ', P0
print 'P2 = ', P2
print 'Xs = ', Xs
print 'Ys = ', Ys
print 'Zs = ', Zs
print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS
结果如下:
P0 = ['5.02162261', '4.97568007', '4.98494429']
这是错误的,因为 5.02162261
不是来自 FINAL OPTIMIZED GEOMETRY
(见文件)。
还有坐标错误:
Xs = ['0.000000000000E+00', '3.333333333333E-01', '-4.079267158859E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01']
Ys = ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01']
Zs = ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']
ATOMIC_NUMBERS = ['20', '6', '8', '20', '6', '8', '20', '6', '8']
这将是期望的结果:
VOLUMES = ['119.823364', '121.143469']
P0 = ['4.97568007', '4.98494429']
P1 = [16.76591397, '16.88768068']
Xs = ['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01']
Ys = ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01']
Zs = ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']
ATOMIC_NUMBERS = ['20', '6', '8', '20', '6', '8']
如果你能帮助我,我将不胜感激
整个代码:
import sys
import re
import os
initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3$'
middle_pattern = '^ CRYSTALLOGRAPHIC CELL '
end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$'
global N_atom_irreducible_unit
N_atom_irreducible_unit = 3
VOLUMES = []
P0 = []
P2 = []
ATOMIC_NUMBERS = []
Xs = []
Ys = []
Zs = []
with open('g.out') as file:
passed_mid_point = False
for line in file:
if re.match(initial_pattern, line):
passed_mid_point = False
print file.next()
print file.next()
print file.next()
volume_line = file.next()
print volume_line
aux = volume_line.split()
each_volume = aux[7]
print each_volume
VOLUMES.append(each_volume)
if re.match(middle_pattern, line):
passed_mid_point = True
print line
print file.next()
parameters_line = file.next()
aux = parameters_line.split()
p0 = aux[0]
p1 = aux[1]
p2 = aux[2]
p3 = aux[3]
p4 = aux[4]
p5 = aux[5] #
print p0
print p2
P0.append(p0)
P2.append(p2)
print file.next()
print file.next()
print file.next()
print file.next()
if re.match(end_pattern, line):
passed_mid_point = False
elif passed_mid_point:
# parse the coordinates
print 'line2 =', line
terms = line.split()
print 'terms =', terms
# print 'terms[1] =', terms[1]
if terms and terms[1] == 'T':
print terms[1]
atomic_number = terms[2]
print 'atomic_number = ', atomic_number
ATOMIC_NUMBERS.append(atomic_number)
x = terms[4]
print 'x =', x
Xs.append(x)
y = terms[5]
print 'y = ', y
Ys.append(y)
z = terms[6]
print 'z = ', z
Zs.append(z)
print 'VOLUMES = ', VOLUMES
print 'P0 = ', P0
print 'P2 = ', P2
print 'Xs = ', Xs
print 'Ys = ', Ys
print 'Zs = ', Zs
print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS
我为您的脚本编写了一个简化版本,它似乎可以工作。我希望这可以作为您最终脚本的起点:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
VOLUMES = []
P0 = []
P2 = []
ATOMIC_NUMBERS = []
Xs = []
Ys = []
Zs = []
with open('g.out') as gout:
final_optimized_geometry = False
for line in gout:
if 'FINAL OPTIMIZED GEOMETRY' in line:
final_optimized_geometry = True
elif 'PRIMITIVE CELL' in line:
if not final_optimized_geometry:
continue
volume = line.split()[7]
VOLUMES.append(volume)
elif 'CRYSTALLOGRAPHIC CELL (VOLUME=' in line:
if not final_optimized_geometry:
continue
gout.readline()
line = gout.readline()
p0, p2 = line.split()[0:3:2]
P0.append(p0)
P2.append(p2)
elif 'COORDINATES IN THE CRYSTALLOGRAPHIC CELL' in line:
if not final_optimized_geometry:
continue
gout.readline()
gout.readline()
while 'T = ATOM BELONGING TO THE ASYMMETRIC UNIT' not in line:
line = gout.readline()
atomdata = line.split()
if not atomdata or atomdata[1] != 'T':
continue
atomicnumber = atomdata[2]
x, y, z = atomdata[4:7]
ATOMIC_NUMBERS.append(atomicnumber)
Xs.append(x)
Ys.append(y)
Zs.append(z)
final_optimized_geometry = False
print(VOLUMES)
print(P0)
print(P2)
print(ATOMIC_NUMBERS)
print(Xs)
print(Ys)
print(Zs)
这会生成以下输出:
['119.823364', '121.143469']
['4.97568007', '4.98494429']
['16.76591397', '16.88768068']
['20', '6', '8', '20', '6', '8']
['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01']
['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01']
['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']
事实上,它是一个非常简单的有限状态机,只有两个状态。警告:如果在一个最终优化的几何结构中有多个晶体单元,它将不起作用。在这种情况下,它只会捕获第一个单元格的信息。
该代码还对文件进行了其他假设,当然这可能需要验证。
我避免使用正则表达式。
此代码只会在 Python 3 中 运行(针对 Python 3.6.2 进行测试)。 Python 2.7 会因为在文件迭代块中使用 readline()
而窒息(这有点道理,但很高兴看到 Python 3 可以接受)。我们正在使用 readline()
作为一个小 hack 来跳过输入文件中我们知道必须跳过的行,而无需再次经历整个循环(这将需要更多标志变量)。
顺便说一句,如果您的唯一任务是解析文本文件,那么查看专用语言可能会很有趣,例如 Lex。此外,Perl 就是为做这样的事情而设计的,而不是 Python。
希望对您有所帮助!
感谢所有@Bart Van Loon 的帮助,代码的更简单版本是:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
global N_atom_irreducible_unit
N_atom_irreducible_unit = 3
filename = 'g.out'
VOLUMES = []
P0 = []
P2 = []
ATOMIC_NUMBERS = []
Xs = []
Ys = []
Zs = []
with open(filename) as gout:
final_optimized_geometry = False
for line in gout:
if 'FINAL OPTIMIZED GEOMETRY' in line:
final_optimized_geometry = True
elif 'PRIMITIVE CELL - CENTRING CODE' in line:
if final_optimized_geometry:
volume = line.split()
print volume
print volume[7]
volume = line.split()[7]
VOLUMES.append(volume)
elif ' CRYSTALLOGRAPHIC CELL (V' in line:
if final_optimized_geometry:
print 'gout.next() =', gout.next()
done = gout.next()
print 'done =', done
p0 = done.split()[0]
p2 = done.split()[2]
# p0, p2 = done.split()[0:3:2]
P0.append(p0)
P2.append(p2)
elif 'COORDINATES IN THE CRYSTALLOGRAPHIC CELL' in line:
if final_optimized_geometry:
gout.next()
gout.next()
while True:
line = gout.next()
atomdata = line.split()
if not atomdata:
break
if atomdata[1] != 'T':
continue
atomicnumber = atomdata[2]
x, y, z = atomdata[4:7]
ATOMIC_NUMBERS.append(atomicnumber)
Xs.append(x)
Ys.append(y)
Zs.append(z)
final_optimized_geometry = False
print 'VOLUMES = ', VOLUMES
print 'P0 = ', P0
print 'P2 = ', P2
print 'Xs = ', Xs
print 'Ys = ', Ys
print 'Zs = ', Zs
print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS
其中:
1) 因为最后一个原子(本例中为第 10 个原子)之后的下一行是空行,
if not atomdata:
break
总是在atomdata
为空时停止。换句话说,这将始终在空行时停止,即当原子列表结束时。因此,这将允许避免 while 'T = ATOM BELONGING TO THE ASYMMETRIC UNIT' not in line:
语句。
类似的说法是:
if atomdata:
continue
但是,出于某种我不明白的原因,这无法将非空行解释为唯一要分析的行。为什么?
2)这部分代码:
if atomdata[1] != 'T':
continue
atomicnumber = atomdata[2]
x, y, z = atomdata[4:7]
ATOMIC_NUMBERS.append(atomicnumber)
Xs.append(x)
Ys.append(y)
Zs.append(z)
也可以表示为:
if atomdata[1] == 'T':
atomicnumber = atomdata[2]
x, y, z = atomdata[4:7]
ATOMIC_NUMBERS.append(atomicnumber)
Xs.append(x)
Ys.append(y)
Zs.append(z)