切片线并将参数保存到不同的文件中

Slice lines and save parameters into different files

我有一个 g.out 文件(粘贴在下方)。

此文件包含我要提取的几个 FINAL OPTIMIZED 几何图形。

对于给定的 FINAL OPTIMIZED GEOMETRY,这些突出显示的值是我想要提取的值:

我在下面的程序中设法提取了前三个:VOLUMEA,以及 B

我的代码:

import os
import sys
import re

initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM      3$'
middle_pattern = '^ CRYSTALLOGRAPHIC CELL '
end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$'


VOLUMES = []
P0 = []
P2 = []
atomic_number = []
coord_x = []
coord_y = []
coord_z = []

with open('g.out') as file:
    for line in file:
        if re.match(initial_pattern, line):
            print file.next()
            print file.next()
            print file.next()

            volume_line = file.next()
            print volume_line
            aux = volume_line.split()
            each_volume = aux[7]
            print each_volume
            VOLUMES.append(each_volume)

        if re.match(middle_pattern, line):
            print line

            print file.next()
            parameters_line = file.next()
            aux = parameters_line.split()
            p0 = aux[0]
            p1 = aux[1]
            p2 = aux[2]
            p3 = aux[3]
            p4 = aux[4]
            p5 = aux[5] # 

            print p0
            print p2

            P0.append(p0)
            P2.append(p2)

            print file.next()
            print file.next()
            print file.next()
            print file.next()

            first_coord_line = file.next()
            print first_coord_line

        if re.match(end_pattern, line):
            end_pattern = line
            print end_pattern
            all_coordinates =  [first_coord_line:end_pattern]
            for line in all_coordinates:
              del('F ')             # delete those that contain 'F '
              aux2 =  line.split()
              coords = []


sys.exit()
#Template = 
"""
some stuff
other stuff
p0      p2
3
A    B        C         D
E    F        G         H
I    J        K         L
other stuff
some other stuff
"""

我无法提取 COORDINATES,因为我找不到从 first_coord_lineend_pattern 的分割线的方法,就像在这个伪代码中一样:

if re.match(end_pattern, line):
    end_pattern = line
    print end_pattern
    all_coordinates =  [first_coord_line:end_pattern]
    for line in all_coordinates:
      del('F ')             # delete those that contain 'F '
      aux2 =  line.split()  # split lines
      atomic_number = aux2[2]
      coord_x = aux2[4]
      coord_y = aux2[5]
      coord_z = aux2[6]

有没有办法实现这个伪代码?

在我的代码中,VOLUMESP0P2atomic_numbercoord_xcoord_ycoord_z是用列表初始化,因为在结束 for 循环之前我想保存在不同的文件中,以“VOLUME.inp”的名称命名,此信息:

#Template = 
"""
some stuff
other stuff
p0      p2
3
A    B        C         D
E    F        G         H
I    J        K         L
other stuff
some other stuff
"""

其中 p0p2 是我的代码中提取的值(屏幕截图中第二和第三个突出显示的值),A-Latomic_numbercoord_x, coord_y, coord_z.

有办法实现吗?

g.out 文件:

more lines
more lines
more lines

 FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM      3
 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500)
 *******************************************************************************
 LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM
 PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME=   119.823364 - DENSITY  2.770 g/cm^3
         A              B              C           ALPHA      BETA       GAMMA
     6.28373604     6.28373604     6.28373604    46.646397  46.646397  46.646397
 *******************************************************************************
 ATOMS IN THE ASYMMETRIC UNIT    3 - ATOMS IN THE UNIT CELL:   10
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA   -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01
      3 T   6 C     2.500000000000E-01  2.500000000000E-01  2.500000000000E-01
      4 F   6 C    -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01
      5 T   8 O    -4.924094276183E-01 -7.590572381674E-03  2.500000000000E-01
      6 F   8 O     2.500000000000E-01 -4.924094276183E-01 -7.590572381674E-03
      7 F   8 O    -7.590572381674E-03  2.500000000000E-01 -4.924094276183E-01
      8 F   8 O     4.924094276183E-01  7.590572381674E-03 -2.500000000000E-01
      9 F   8 O    -2.500000000000E-01  4.924094276183E-01  7.590572381674E-03
     10 F   8 O     7.590572381674E-03 -2.500000000000E-01  4.924094276183E-01

 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL
  1.0000  0.0000  1.0000 -1.0000  1.0000  1.0000  0.0000 -1.0000  1.0000

 *******************************************************************************
 CRYSTALLOGRAPHIC CELL (VOLUME=        359.47009054)
         A              B              C           ALPHA      BETA       GAMMA
     4.97568007     4.97568007    16.76591397    90.000000  90.000000 120.000000

 COORDINATES IN THE CRYSTALLOGRAPHIC CELL
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA   -5.491739570355E-17 -2.745869785177E-17 -5.000000000000E-01
      3 T   6 C     3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02
      4 F   6 C    -3.333333333333E-01  3.333333333333E-01  8.333333333333E-02
      5 T   8 O    -4.090760942850E-01 -3.333333333333E-01 -8.333333333333E-02
      6 F   8 O     3.333333333333E-01 -7.574276095166E-02 -8.333333333333E-02
      7 F   8 O     7.574276095166E-02  4.090760942850E-01 -8.333333333333E-02
      8 F   8 O     4.090760942850E-01  3.333333333333E-01  8.333333333333E-02
      9 F   8 O    -3.333333333333E-01  7.574276095166E-02  8.333333333333E-02
     10 F   8 O    -7.574276095166E-02 -4.090760942850E-01  8.333333333333E-02

 T = ATOM BELONGING TO THE ASYMMETRIC UNIT
 INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE

more lines
more lines
more lines

 FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM      3
 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500)
 *******************************************************************************
 LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM
 PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME=   121.143469 - DENSITY  2.740 g/cm^3
         A              B              C           ALPHA      BETA       GAMMA
     6.32229536     6.32229536     6.32229536    46.436583  46.436583  46.436583
 *******************************************************************************
 ATOMS IN THE ASYMMETRIC UNIT    3 - ATOMS IN THE UNIT CELL:   10
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA    5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01
      3 T   6 C     2.500000000000E-01  2.500000000000E-01  2.500000000000E-01
      4 F   6 C    -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01
      5 T   8 O    -4.927088991116E-01 -7.291100888437E-03  2.500000000000E-01
      6 F   8 O     2.500000000000E-01 -4.927088991116E-01 -7.291100888437E-03
      7 F   8 O    -7.291100888437E-03  2.500000000000E-01 -4.927088991116E-01
      8 F   8 O     4.927088991116E-01  7.291100888437E-03 -2.500000000000E-01
      9 F   8 O    -2.500000000000E-01  4.927088991116E-01  7.291100888437E-03
     10 F   8 O     7.291100888437E-03 -2.500000000000E-01  4.927088991116E-01

 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL
  1.0000  0.0000  1.0000 -1.0000  1.0000  1.0000  0.0000 -1.0000  1.0000

 *******************************************************************************
 CRYSTALLOGRAPHIC CELL (VOLUME=        363.43040599)
         A              B              C           ALPHA      BETA       GAMMA
     4.98494429     4.98494429    16.88768068    90.000000  90.000000 120.000000

 COORDINATES IN THE CRYSTALLOGRAPHIC CELL
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA   -5.471726358381E-17 -2.735863179191E-17 -5.000000000000E-01
      3 T   6 C     3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02
      4 F   6 C    -3.333333333333E-01  3.333333333333E-01  8.333333333333E-02
      5 T   8 O    -4.093755657782E-01 -3.333333333333E-01 -8.333333333333E-02
      6 F   8 O     3.333333333333E-01 -7.604223244490E-02 -8.333333333333E-02
      7 F   8 O     7.604223244490E-02  4.093755657782E-01 -8.333333333333E-02
      8 F   8 O     4.093755657782E-01  3.333333333333E-01  8.333333333333E-02
      9 F   8 O    -3.333333333333E-01  7.604223244490E-02  8.333333333333E-02
     10 F   8 O    -7.604223244490E-02 -4.093755657782E-01  8.333333333333E-02

 T = ATOM BELONGING TO THE ASYMMETRIC UNIT
 INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE

more lines
more lines
more lines

更新代码:

基于@nos flag 的方法,以下代码能够提取信息。 VOLUMES 是一个包含 2 个元素的列表。 以下列表是结果:

VOLUMES =  ['119.823364', '121.143469']
P0 =  ['4.97568007', '4.98494429']
P2 =  ['16.76591397', '16.88768068']
Xs =  ['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01']
Ys =  ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01']
Zs =  ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']
ATOMIC_NUMBERS =  ['20', '6', '8', '20', '6', '8']

这个post的第二部分是写这个信息(P0, P2, ATOMIC_NUMBERS, Xs, Ys , Zs) 在两个 VOLUME.inp 文件中。换句话说,类似于:

V_119.823364.inp 文件:

some stuff
other stuff
4.97568007   4.98494429
3
20 0.000000000000E+00    0.000000000000E+00   0.000000000000E+00
6  3.333333333333E-01   -3.333333333333E-01  -8.333333333333E-02
8 -4.090760942850E-01   -3.333333333333E-01  -8.333333333333E-02
other stuff

V_121.143469.inp 文件:

some stuff
other stuff
4.97568007   4.98494429
3
20 0.000000000000E+00    0.000000000000E+00   0.000000000000E+00
6  3.333333333333E-01   -3.333333333333E-01  -8.333333333333E-02
8 -4.093755657782E-01   -3.333333333333E-01  -8.333333333333E-02
other stuff

根据@nos的atoms_per_frameatoms_all_frames的建议,我尝试了以下代码。我发现在按元素写入文件时遇到困难,即:

import os
import sys
import re
import glob

initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM      3$'
middle_pattern = '^ CRYSTALLOGRAPHIC CELL '
end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$'

global N_atom_irreducible_unit
N_atom_irreducible_unit = 3

VOLUMES = []
P0 = []
P2 = []
ATOMIC_NUMBERS = []
Xs = []
Ys = []
Zs = []

with open('g.out') as file:
    passed_mid_point = False
    for line in file:
        if re.match(initial_pattern, line):
            print file.next()
            print file.next()
            print file.next()

            volume_line = file.next()
            print volume_line
            aux = volume_line.split()
            each_volume = aux[7]
            print each_volume
            VOLUMES.append(each_volume)

        if re.match(middle_pattern, line):
            print line

            print file.next()
            parameters_line = file.next()
            aux = parameters_line.split()
            p0 = aux[0]
            p1 = aux[1]
            p2 = aux[2]
            p3 = aux[3]
            p4 = aux[4]
            p5 = aux[5] # 

            print p0
            print p2

            P0.append(p0)
            P2.append(p2)

            print file.next()
            print file.next()
            print file.next()
            print file.next()

        if re.match(middle_pattern, line):
            passed_mid_point = True
            print 'line = ', line

        if re.match(end_pattern, line):
            passed_mid_point = False

        elif passed_mid_point:
            # parse the coordinates
            print 'line2 =', line
            terms = line.split()
            print 'terms =', terms

        if terms and terms[1] == 'T':
            print terms[1]
            atomic_number = terms[2]
            print 'atomic_number = ', atomic_number
            ATOMIC_NUMBERS.append(atomic_number)

            x = terms[4]
            print 'x =', x
            Xs.append(x)

            y = terms[5]
            print 'y = ', y
            Ys.append(y)

            z = terms[6]
            print 'z = ', z
            Zs.append(z)

print 'VOLUMES = ', VOLUMES
print 'P0 = ', P0
print 'P2 = ', P2
print 'Xs = ', Xs
print 'Ys = ', Ys
print 'Zs = ', Zs
print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS

# create the empty list of lists:
atoms_all_frames = [[] for _ in xrange(len(VOLUMES))]
print atoms_all_frames

for index_vol in range(len(VOLUMES)):
  for index in range(len(ATOMIC_NUMBERS)):
    atoms_per_frame = [ATOMIC_NUMBERS[index], Xs[index], Ys[index], Zs[index]]
    atoms_all_frames[index_vol].append(atoms_per_frame)

# "atoms_all_frames" would be an appropriate list for looping
print atoms_all_frames

# Remove any existing V*.inp files, to clean first: 
for f in glob.glob("V*.inp"):
  os.remove(f)

# create the files:
for V in VOLUMES:
  filename = "V_{}.d12".format(V)
  print filename

  # open them:
  with open(filename,"a") as f:

   # the following is a pseudo-code, because I cannot manage to 
   # find the way to write element-wise each string to the files:
   for p0, p2, atoms_all_frames:

      f.write("""some stuff
other stuff
%s      %s
%s
%s    %s        %s         %s
%s    %s        %s         %s
%s    %s        %s         %s
other stuff
some other stuff\n""" % p0 % p2 %N_atom_irreducible_unit %atoms_all_frames)

有很多方法可以做到这一点。重要的是要区分你是否通过了mid_pattern,因为它前后都存在相同的坐标模式,并且只需要它之后的那些。

例如,您可以

  1. 设置一个标志,以便我们知道 mid_pattern 已匹配
  2. end_pattern 匹配

    分支
    passed_mid_point = False
    ...
    if re.match(middle_pattern, line):
        passed_mid_point = True
        # do what you need
        ...
    if re.match(end_pattern, line):
        passed_mid_point = False # so you can process a new frame
        # do what you need after end pattern is matched
        ...
    elif passed_mid_point:
        # parse the coordinates
        terms = line.split()
        if terms and terms[1] == 'T':
            x = float(terms[4])
            y = float(terms[5])
            z = float(terms[6])
    

或者,您可以标记和匹配,如下所示:

    passed_mid_point = False
    coord_patter = r'      \d+ T '
    ...
    if re.match(middle_pattern, line):
        passed_mid_point = True
        # do what you need
        ...
    if re.match(end_pattern, line):
        passed_mid_point = False # so you can process a new frame
        # do what you need after end pattern is matched
        ...
    if passed_mid_point and re.match(coord_pattern, line):
        # parse the coordinates
        terms = line.split()
        if terms and terms[1] == 'T':
            x = float(terms[4])
            y = float(terms[5])
            z = float(terms[6])

坐标匹配也完全可以用正则表达式来完成

sci_num = r'-?\d+\.\d*E[+\-]\d+'
coord_pattern = r'\s+\d+\sT\s+\d+\s+[A-Z]+\s+(%s)\s+(%s)\s+(%s)' % (sci_num, sci_num, sci_num)
coord_re = re.compile(coord_pattern)
if coord_re.match(line):
    x = float(coord_re.group(1))
    y = float(coord_re.group(2))
    z = float(coord_re.group(3))

为了记录数据,最好记录原子坐标所属的坐标系。例如,您可以在开头创建一个 atom_frames。并继续向其附加原子坐标列表,其中每个列表对应一个框架。总体看起来像这样

atom_frames = []
for i in range(50): # here I assume 50 frames
    current_frame = []
    for a in atoms_in_this_frame:
        current_frame.append(a)  # a could be (x, y, z) of an atom
    atom_frames.append(current_frame)

这里我只是循环帧数。在您的情况下,您可以在点击 mid_pattern 时创建 current_frame = []。当您点击 end_pattern 时执行 atom_frames.append(current_frame)。希望它有意义。