提取和处理两个字符串之间的信息,这些字符串在文件中重复多次

Extract and process information between two strings, being these strings repeated multiple times along the file

我有一个结构如下的文件:

 LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM
 PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME=   122.771603 - DENSITY  2.704 g/cm^3
         A              B              C           ALPHA      BETA       GAMMA
     6.32540491     6.32540491     6.32540491    46.774144  46.774144  46.774144
 *******************************************************************************
 ATOMS IN THE ASYMMETRIC UNIT    3 - ATOMS IN THE UNIT CELL:   10
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA   -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01
      3 T   6 C     2.500000000000E-01  2.500000000000E-01  2.500000000000E-01
      4 F   6 C    -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01
      5 T   8 O    -4.912600492192E-01 -8.739950780750E-03  2.500000000000E-01
      6 F   8 O     2.500000000000E-01 -4.912600492193E-01 -8.739950780750E-03
      7 F   8 O    -8.739950780750E-03  2.500000000000E-01 -4.912600492193E-01
      8 F   8 O     4.912600492193E-01  8.739950780750E-03 -2.500000000000E-01
      9 F   8 O    -2.500000000000E-01  4.912600492193E-01  8.739950780750E-03
     10 F   8 O     8.739950780750E-03 -2.500000000000E-01  4.912600492193E-01

 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL
  1.0000  0.0000  1.0000 -1.0000  1.0000  1.0000  0.0000 -1.0000  1.0000

 *******************************************************************************
 CRYSTALLOGRAPHIC CELL (VOLUME=        368.31480902)
         A              B              C           ALPHA      BETA       GAMMA
     5.02162261     5.02162261    16.86554607    90.000000  90.000000 120.000000

 COORDINATES IN THE CRYSTALLOGRAPHIC CELL
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA    0.000000000000E+00  0.000000000000E+00 -5.000000000000E-01
      3 T   6 C     3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02
      4 F   6 C    -3.333333333333E-01  3.333333333333E-01  8.333333333333E-02
      5 T   8 O    -4.079267158859E-01 -3.333333333333E-01 -8.333333333333E-02
      6 F   8 O     3.333333333333E-01 -7.459338255258E-02 -8.333333333333E-02
      7 F   8 O     7.459338255258E-02  4.079267158859E-01 -8.333333333333E-02
      8 F   8 O     4.079267158859E-01  3.333333333333E-01  8.333333333333E-02
      9 F   8 O    -3.333333333333E-01  7.459338255258E-02  8.333333333333E-02
     10 F   8 O    -7.459338255258E-02 -4.079267158859E-01  8.333333333333E-02

 T = ATOM BELONGING TO THE ASYMMETRIC UNIT


more lines
more lines
more lines

 FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM      3
 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500)
 *******************************************************************************
 LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM
 PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME=   119.823364 - DENSITY  2.770 g/cm^3
         A              B              C           ALPHA      BETA       GAMMA
     6.28373604     6.28373604     6.28373604    46.646397  46.646397  46.646397
 *******************************************************************************
 ATOMS IN THE ASYMMETRIC UNIT    3 - ATOMS IN THE UNIT CELL:   10
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA   -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01
      3 T   6 C     2.500000000000E-01  2.500000000000E-01  2.500000000000E-01
      4 F   6 C    -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01
      5 T   8 O    -4.924094276183E-01 -7.590572381674E-03  2.500000000000E-01
      6 F   8 O     2.500000000000E-01 -4.924094276183E-01 -7.590572381674E-03
      7 F   8 O    -7.590572381674E-03  2.500000000000E-01 -4.924094276183E-01
      8 F   8 O     4.924094276183E-01  7.590572381674E-03 -2.500000000000E-01
      9 F   8 O    -2.500000000000E-01  4.924094276183E-01  7.590572381674E-03
     10 F   8 O     7.590572381674E-03 -2.500000000000E-01  4.924094276183E-01

 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL
  1.0000  0.0000  1.0000 -1.0000  1.0000  1.0000  0.0000 -1.0000  1.0000

 *******************************************************************************
 CRYSTALLOGRAPHIC CELL (VOLUME=        359.47009054)
         A              B              C           ALPHA      BETA       GAMMA
     4.97568007     4.97568007    16.76591397    90.000000  90.000000 120.000000

 COORDINATES IN THE CRYSTALLOGRAPHIC CELL
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA   -5.491739570355E-17 -2.745869785177E-17 -5.000000000000E-01
      3 T   6 C     3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02
      4 F   6 C    -3.333333333333E-01  3.333333333333E-01  8.333333333333E-02
      5 T   8 O    -4.090760942850E-01 -3.333333333333E-01 -8.333333333333E-02
      6 F   8 O     3.333333333333E-01 -7.574276095166E-02 -8.333333333333E-02
      7 F   8 O     7.574276095166E-02  4.090760942850E-01 -8.333333333333E-02
      8 F   8 O     4.090760942850E-01  3.333333333333E-01  8.333333333333E-02
      9 F   8 O    -3.333333333333E-01  7.574276095166E-02  8.333333333333E-02
     10 F   8 O    -7.574276095166E-02 -4.090760942850E-01  8.333333333333E-02

 T = ATOM BELONGING TO THE ASYMMETRIC UNIT
 INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE

more lines
more lines
more lines

 FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM      3
 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500)
 *******************************************************************************
 LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM
 PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME=   121.143469 - DENSITY  2.740 g/cm^3
         A              B              C           ALPHA      BETA       GAMMA
     6.32229536     6.32229536     6.32229536    46.436583  46.436583  46.436583
 *******************************************************************************
 ATOMS IN THE ASYMMETRIC UNIT    3 - ATOMS IN THE UNIT CELL:   10
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA    5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01
      3 T   6 C     2.500000000000E-01  2.500000000000E-01  2.500000000000E-01
      4 F   6 C    -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01
      5 T   8 O    -4.927088991116E-01 -7.291100888437E-03  2.500000000000E-01
      6 F   8 O     2.500000000000E-01 -4.927088991116E-01 -7.291100888437E-03
      7 F   8 O    -7.291100888437E-03  2.500000000000E-01 -4.927088991116E-01
      8 F   8 O     4.927088991116E-01  7.291100888437E-03 -2.500000000000E-01
      9 F   8 O    -2.500000000000E-01  4.927088991116E-01  7.291100888437E-03
     10 F   8 O     7.291100888437E-03 -2.500000000000E-01  4.927088991116E-01

 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL
  1.0000  0.0000  1.0000 -1.0000  1.0000  1.0000  0.0000 -1.0000  1.0000

 *******************************************************************************
 CRYSTALLOGRAPHIC CELL (VOLUME=        363.43040599)
         A              B              C           ALPHA      BETA       GAMMA
     4.98494429     4.98494429    16.88768068    90.000000  90.000000 120.000000

 COORDINATES IN THE CRYSTALLOGRAPHIC CELL
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA   -5.471726358381E-17 -2.735863179191E-17 -5.000000000000E-01
      3 T   6 C     3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02
      4 F   6 C    -3.333333333333E-01  3.333333333333E-01  8.333333333333E-02
      5 T   8 O    -4.093755657782E-01 -3.333333333333E-01 -8.333333333333E-02
      6 F   8 O     3.333333333333E-01 -7.604223244490E-02 -8.333333333333E-02
      7 F   8 O     7.604223244490E-02  4.093755657782E-01 -8.333333333333E-02
      8 F   8 O     4.093755657782E-01  3.333333333333E-01  8.333333333333E-02
      9 F   8 O    -3.333333333333E-01  7.604223244490E-02  8.333333333333E-02
     10 F   8 O    -7.604223244490E-02 -4.093755657782E-01  8.333333333333E-02

 T = ATOM BELONGING TO THE ASYMMETRIC UNIT
 INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE

more lines
more lines
more lines

我想提取CRYSTALLOGRAPHIC CELL的信息;但只有来自 FINAL OPTIMIZED GEOMETRY 的那个。

以下 3 场比赛:

initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM      3$'
middle_pattern = '^ CRYSTALLOGRAPHIC CELL '
end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$'

允许搜索信息。

首先,我定义了一个flag passed_mid_point = False,

然后程序的以下部分提取 FINAL OPTIMIZED GEOMETRYCRYSTALLOGRAPHIC CELLVOLUME:

VOLUMES = []
with open('g.out') as file:
    passed_mid_point = False
    for line in file:
        if re.match(initial_pattern, line):
            passed_mid_point = False
            print file.next()
            print file.next()
            print file.next()

            volume_line = file.next()
            print volume_line
            aux = volume_line.split()
            each_volume = aux[7]
            print each_volume
            VOLUMES.append(each_volume)
print 'VOLUMES = ', VOLUMES

这是正确的,因为 VOLUMES = ['119.823364', '121.143469']。请注意,最初的 122.771603(请参阅原始文件)没有像预期的那样聚集在列表中。

提取AC时(在我的程序中,P0P1),FINAL OPTIMIZED GEOMETRY的参数CRYSTALLOGRAPHIC CELL,连同坐标:

        if re.match(middle_pattern, line):
            passed_mid_point = True

            print line

            print file.next()
            parameters_line = file.next()
            aux = parameters_line.split()
            p0 = aux[0]
            p1 = aux[1]
            p2 = aux[2]
            p3 = aux[3]
            p4 = aux[4]
            p5 = aux[5] # 

            print p0
            print p2

            P0.append(p0)
            P2.append(p2)

            print file.next()
            print file.next()
            print file.next()
            print file.next()

        if re.match(end_pattern, line):
            passed_mid_point = False

        elif passed_mid_point:
            # parse the coordinates
            print 'line2 =', line
            terms = line.split()
            print 'terms =', terms
#           print 'terms[1] =', terms[1]

            if terms and terms[1] == 'T':
                print terms[1]
                atomic_number = terms[2]
                print 'atomic_number = ', atomic_number
                ATOMIC_NUMBERS.append(atomic_number)

                x = terms[4]
                print 'x =', x
                Xs.append(x)

                y = terms[5]
                print 'y = ', y
                Ys.append(y)

                z = terms[6]
                print 'z = ', z
                Zs.append(z)

print 'VOLUMES = ', VOLUMES
print 'P0 = ', P0
print 'P2 = ', P2
print 'Xs = ', Xs
print 'Ys = ', Ys
print 'Zs = ', Zs
print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS

结果如下:

P0 =  ['5.02162261', '4.97568007', '4.98494429']

这是错误的,因为 5.02162261 不是来自 FINAL OPTIMIZED GEOMETRY(见文件)。

还有坐标错误:

Xs =  ['0.000000000000E+00', '3.333333333333E-01', '-4.079267158859E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01']
Ys =  ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01']
Zs =  ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']
ATOMIC_NUMBERS =  ['20', '6', '8', '20', '6', '8', '20', '6', '8']

这将是期望的结果:

VOLUMES =  ['119.823364', '121.143469']
P0 = ['4.97568007', '4.98494429']
P1 = [16.76591397, '16.88768068']
Xs =  ['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01']
Ys =  ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01']
Zs =  ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']
ATOMIC_NUMBERS =  ['20', '6', '8', '20', '6', '8']

如果你能帮助我,我将不胜感激

整个代码:

import sys
import re
import os

initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM      3$'
middle_pattern = '^ CRYSTALLOGRAPHIC CELL '
end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$'

global N_atom_irreducible_unit
N_atom_irreducible_unit = 3

VOLUMES = []
P0 = []
P2 = []
ATOMIC_NUMBERS = []
Xs = []
Ys = []
Zs = []

with open('g.out') as file:
    passed_mid_point = False
    for line in file:
        if re.match(initial_pattern, line):
            passed_mid_point = False
            print file.next()
            print file.next()
            print file.next()

            volume_line = file.next()
            print volume_line
            aux = volume_line.split()
            each_volume = aux[7]
            print each_volume
            VOLUMES.append(each_volume)

        if re.match(middle_pattern, line):
            passed_mid_point = True

            print line

            print file.next()
            parameters_line = file.next()
            aux = parameters_line.split()
            p0 = aux[0]
            p1 = aux[1]
            p2 = aux[2]
            p3 = aux[3]
            p4 = aux[4]
            p5 = aux[5] # 

            print p0
            print p2

            P0.append(p0)
            P2.append(p2)

            print file.next()
            print file.next()
            print file.next()
            print file.next()

        if re.match(end_pattern, line):
            passed_mid_point = False

        elif passed_mid_point:
            # parse the coordinates
            print 'line2 =', line
            terms = line.split()
            print 'terms =', terms
#           print 'terms[1] =', terms[1]

            if terms and terms[1] == 'T':
                print terms[1]
                atomic_number = terms[2]
                print 'atomic_number = ', atomic_number
                ATOMIC_NUMBERS.append(atomic_number)

                x = terms[4]
                print 'x =', x
                Xs.append(x)

                y = terms[5]
                print 'y = ', y
                Ys.append(y)

                z = terms[6]
                print 'z = ', z
                Zs.append(z)

print 'VOLUMES = ', VOLUMES
print 'P0 = ', P0
print 'P2 = ', P2
print 'Xs = ', Xs
print 'Ys = ', Ys
print 'Zs = ', Zs
print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS

我为您的脚本编写了一个简化版本,它似乎可以工作。我希望这可以作为您最终脚本的起点:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

VOLUMES = []
P0 = []
P2 = []
ATOMIC_NUMBERS = []
Xs = []
Ys = []
Zs = []

with open('g.out') as gout:
    final_optimized_geometry = False
    for line in gout:
        if 'FINAL OPTIMIZED GEOMETRY' in line:
            final_optimized_geometry = True
        elif 'PRIMITIVE CELL' in line:
            if not final_optimized_geometry:
                continue
            volume = line.split()[7]
            VOLUMES.append(volume)
        elif 'CRYSTALLOGRAPHIC CELL (VOLUME=' in line:
            if not final_optimized_geometry:
                continue
            gout.readline()
            line = gout.readline()
            p0, p2 = line.split()[0:3:2]

            P0.append(p0)
            P2.append(p2)
        elif 'COORDINATES IN THE CRYSTALLOGRAPHIC CELL' in line:
            if not final_optimized_geometry:
                continue
            gout.readline()
            gout.readline()
            while 'T = ATOM BELONGING TO THE ASYMMETRIC UNIT' not in line:
                line = gout.readline()
                atomdata = line.split()
                if not atomdata or atomdata[1] != 'T':
                    continue
                atomicnumber = atomdata[2]
                x, y, z = atomdata[4:7]
                ATOMIC_NUMBERS.append(atomicnumber)
                Xs.append(x)
                Ys.append(y)
                Zs.append(z)
            final_optimized_geometry = False


print(VOLUMES)
print(P0)
print(P2)
print(ATOMIC_NUMBERS)
print(Xs)
print(Ys)
print(Zs)

这会生成以下输出:

['119.823364', '121.143469']
['4.97568007', '4.98494429']
['16.76591397', '16.88768068']
['20', '6', '8', '20', '6', '8']
['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01']
['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01']
['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']

事实上,它是一个非常简单的有限状态机,只有两个状态。警告:如果在一个最终优化的几何结构中有多个晶体单元,它将不起作用。在这种情况下,它只会捕获第一个单元格的信息。

该代码还对文件进行了其他假设,当然这可能需要验证。

我避免使用正则表达式。

此代码只会在 Python 3 中 运行(针对 Python 3.6.2 进行测试)。 Python 2.7 会因为在文件迭代块中使用 readline() 而窒息(这有点道理,但很高兴看到 Python 3 可以接受)。我们正在使用 readline() 作为一个小 hack 来跳过输入文件中我们知道必须跳过的行,而无需再次经历整个循环(这将需要更多标志变量)。

顺便说一句,如果您的唯一任务是解析文本文件,那么查看专用语言可能会很有趣,例如 Lex。此外,Perl 就是为做这样的事情而设计的,而不是 Python。

希望对您有所帮助!

感谢所有@Bart Van Loon 的帮助,代码的更简单版本是:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

global N_atom_irreducible_unit
N_atom_irreducible_unit = 3

filename = 'g.out'

VOLUMES = []
P0 = []
P2 = []
ATOMIC_NUMBERS = []
Xs = []
Ys = []
Zs = []

with open(filename) as gout:
    final_optimized_geometry = False
    for line in gout:
        if 'FINAL OPTIMIZED GEOMETRY' in line:
            final_optimized_geometry = True
        elif 'PRIMITIVE CELL - CENTRING CODE' in line:
            if final_optimized_geometry:
                volume = line.split()
                print volume
                print volume[7]
                volume = line.split()[7]
                VOLUMES.append(volume)

        elif ' CRYSTALLOGRAPHIC CELL (V' in line:
            if final_optimized_geometry:
                print 'gout.next() =', gout.next()
                done = gout.next()
                print 'done =', done
                p0 = done.split()[0]
                p2 = done.split()[2]

#               p0, p2 = done.split()[0:3:2]

                P0.append(p0)
                P2.append(p2)
        elif 'COORDINATES IN THE CRYSTALLOGRAPHIC CELL' in line:
            if final_optimized_geometry:
                gout.next()
                gout.next()
                while True:
                    line = gout.next()
                    atomdata = line.split()
                    if not atomdata:
                        break
                    if atomdata[1] != 'T':
                        continue
                    atomicnumber = atomdata[2]
                    x, y, z = atomdata[4:7]
                    ATOMIC_NUMBERS.append(atomicnumber)
                    Xs.append(x)
                    Ys.append(y)
                    Zs.append(z)
                final_optimized_geometry = False



print 'VOLUMES = ', VOLUMES
print 'P0 = ', P0
print 'P2 = ', P2
print 'Xs = ', Xs
print 'Ys = ', Ys
print 'Zs = ', Zs
print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS

其中:

1) 因为最后一个原子(本例中为第 10 个原子)之后的下一行是空行,

                    if not atomdata:
                        break

总是在atomdata为空时停止。换句话说,这将始终在空行时停止,即当原子列表结束时。因此,这将允许避免 while 'T = ATOM BELONGING TO THE ASYMMETRIC UNIT' not in line: 语句。

类似的说法是:

                    if  atomdata:   
                        continue

但是,出于某种我不明白的原因,这无法将非空行解释为唯一要分析的行。为什么?

2)这部分代码:

                if atomdata[1] != 'T':
                    continue
                atomicnumber = atomdata[2]
                x, y, z = atomdata[4:7]
                ATOMIC_NUMBERS.append(atomicnumber)
                Xs.append(x)
                Ys.append(y)
                Zs.append(z)

也可以表示为:

              if atomdata[1] == 'T':
                  atomicnumber = atomdata[2]
                  x, y, z = atomdata[4:7]
                  ATOMIC_NUMBERS.append(atomicnumber)
                  Xs.append(x)
                  Ys.append(y)
                  Zs.append(z)