如何读取具有多个坐标的文件并存储在单独的数组中?

How to read in file with multiple coordinates and store in separate arrays?

我有一个包含大约 3000 个重复块的文件,如下所示(显示前两个):

    21
Profile.   1 HEAT OF FORMATION =   -79.392 KCAL =   -332.175 KJ
    H       -2.22728       -1.35263        1.32579
    H        1.21425       -1.35263        1.32579
    C        1.43878        0.44129        1.32579
    O        2.25748       -0.52202        1.23773
    C        0.12570       -0.10907        1.38542
    H       -0.47394        0.10034        2.26424
    C       -2.02530       -1.28825       -2.05204
    C       -0.80697       -0.63466       -2.22403
    H       -0.41632       -0.42983       -3.21532
    H        0.84731        0.28355       -1.21782
    C       -0.09866       -0.24043       -1.09182
    C       -1.83256       -1.15779        0.32994
    C       -0.59706       -0.50055        0.19091
    H       -3.51151       -2.06378       -0.69513
    C       -2.55421       -1.55647       -0.78456
    O       -2.78665       -1.71220       -3.09841
    H       -2.37922       -1.48635       -3.96745
    H        2.21062        3.22762        2.75985
    C        1.91952        1.85374        1.37731
    O        2.22890        2.54529        0.44919
    O        1.92486        2.27899        2.65936
    21
Profile.   2 HEAT OF FORMATION =   -79.390 KCAL =   -332.168 KJ
    H       -2.22728       -1.35263        1.32579
    H        1.21674       -1.35282        1.32529
    C        1.43862        0.44132        1.32582
    O        2.25745       -0.52214        1.23772
    C        0.12565       -0.10889        1.38540
    H       -0.47402        0.10051        2.26417
    C       -2.02530       -1.28825       -2.05204
    C       -0.80697       -0.63465       -2.22403
    H       -0.41632       -0.42983       -3.21531
    H        0.84730        0.28355       -1.21782
    C       -0.09865       -0.24043       -1.09182
    C       -1.83256       -1.15780        0.32995
    C       -0.59702       -0.50058        0.19094
    H       -3.51151       -2.06378       -0.69513
    C       -2.55421       -1.55647       -0.78456
    O       -2.78666       -1.71220       -3.09841
    H       -2.37922       -1.48635       -3.96745
    H        2.21061        3.22763        2.75985
    C        1.91953        1.85373        1.37732
    O        2.22890        2.54528        0.44919
    O        1.92486        2.27898        2.65936

其中每组坐标由原子数(在本例中为 21 个)和生成热分隔。

我想知道如何将每组坐标读入和写入单独的数组,以便我最终可以操作这些数组的某些元素。

正如我在评论中所建议的:

阅读所有行:

In [783]: with open('stack40730696.txt','rb') as f:
     ...:     lines = f.readlines()

我可以逐行、逐块地阅读等等。但使用列表最简单。

现在读第一块。看起来'21'是数据行数:

In [784]: i=0
In [785]: n=int(lines[i])
In [786]: n
Out[786]: 21
In [787]: i+=1
In [788]: block=lines[i:i+1+n]   # grab the lines of a block, with header
In [789]: block[0]
Out[789]: b'Profile.   1 HEAT OF FORMATION =   -79.392 KCAL =   -332.175 KJ\n'
In [790]: block[-1]
Out[790]: b'    O        1.92486        2.27899        2.65936\n'

现在用 genfromtxt 加载到一个数组中;并检查结果。我打印了整个东西,但在这里我只打印一些细节。

In [791]: data1=np.genfromtxt(block, skip_header=1,dtype=None)
In [792]: data1.shape
Out[792]: (21,)
In [793]: data1.dtype
Out[793]: dtype([('f0', 'S1'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8')])

前进到下一个块并重复。当然,对于整个文件,我会把它放在一个循环中,然后将 data 数组收集到一个列表中。

In [794]: i=i+1+n
In [795]: n=int(lines[i])
In [796]: i+=1
In [797]: block=lines[i:i+1+n]
In [798]: data2=np.genfromtxt(block, skip_header=1,dtype=None)
In [799]: data2.shape
Out[799]: (21,)
In [800]: data2.dtype
Out[800]: dtype([('f0', 'S1'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8')])

几条记录

In [802]: data2[:3]
Out[802]: 
array([(b'H', -2.22728, -1.35263, 1.32579),
       (b'H', 1.21674, -1.35282, 1.32529),
       (b'C', 1.43862, 0.44132, 1.32582)], 
      dtype=[('f0', 'S1'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8')])

这是一个结构化数组,包含一个字符串字段和三个浮点字段。这可以拆分为一个字符串数组和一个 (21,3) 浮点数组

In [803]: dataf=np.genfromtxt(block, skip_header=1,usecols=[1,2,3])
In [804]: dataf.shape
Out[804]: (21, 3)
In [805]: dataf.dtype
Out[805]: dtype('float64')