在 python 中读取平面文件时折叠多行

Question

我想在 python;

中解析如下所示的平面文件

  Element ID     Element Type     Result       Jacobian Sign    

============== ================= ========= =====================
      1            Parabolic      Warning          1.000000     
                  Hexahedron                                    
      2            Parabolic      Warning          1.000000     
                  Hexahedron                                    
      3            Parabolic      Warning          1.000000     
                  Hexahedron                                    
      4            Parabolic      Warning          1.000000

我尝试使用 this answer 中使用的机制，如下所示；

import pandas as pd

def parse_file(file):
    col_spec = [(0, 15), (16, 33), (34, 43), (44, 65)]
    return pd.read_fwf(file, colspecs=col_spec)

但是它读取了第一行的一条记录和除了单词 'Hexahedron' 作为元素类型之外的空行。

>>> data = parse_file("example.txt")
>>> data.head()
       Element ID      Element Type    Result         Jacobian Sign
0             NaN               NaN       NaN                   NaN
1  ==============  ================  ========  ====================
2               1         Parabolic   Warning              1.000000
3             NaN        Hexahedron       NaN                   NaN <= Extra record
4               2         Parabolic   Warning              1.000000

从行中可以看出，前两行被捕获为2条记录（记录2和3）。我希望解析器将前两行捕获为一条记录，以便将短语 'Parabolic Hexahedron' 捕获为元素类型。我该怎么做？

Answer 1

一些post-处理应该可以解决问题。下面是一些使用 shift 运算符的代码。另请注意，不需要打开文件，只需将文件名传递给 pd.read_fwf.

import pandas as pd

col_spec = [(0, 15), (15, 32), (32, 42), (43, 65)]
df = pd.read_fwf("example.txt", colspecs=col_spec, comment="=")

# combine rows
df["combined"] = (df['Element Type'] + df['Element Type'].shift(-1)).where(df['Element ID'].notnull(), df['Element Type'] )
# remove extra rows
df = df[df['Element ID'].notnull()]

这应该给出一个如下所示的 DataFrame：

  Element ID Element Type   Result Jacobian Sign             combined
2          1    Parabolic  Warning      1.000000  ParabolicHexahedron
4          2    Parabolic  Warning      1.000000  ParabolicHexahedron
6          3    Parabolic  Warning      1.000000  ParabolicHexahedron
8          4    Parabolic  Warning      1.000000  ParabolicHexahedron

在 python 中读取平面文件时折叠多行

Collapse multiple lines when reading flat file in python

python

parsing

flat-file