python 如何拆分不规则分隔的文件?

python How to split irreguarly delimited file?

我遇到了一个有趣的问题,该问题是由于一些新格式和结构不良的数据文件引起的。它们是文本文件,以逗号分隔,包含多组数据,每组数据都有唯一的 header。最初我使用 genFromTxt 只用一个 header 读取一个数据实例。现在有了多个实例,genFromTxt 无法处理它。将文件拆分并将每个单独的实例提供给 genFromTxt 的最佳方法是什么?这是该文件的示例。来自第一个实例的数据立即与第二个实例的 header 对接。每个文件重复大约 20 次。我还没有找到能够将它们分开的通用分隔符。

       0.8 9999.0 999.0 999.0 999.0 9999.0 9999.0 999.0 999.0 999.0 9999.000 999.000 999.0 999.0 99999.0  9.0  9.0  9.0  9.0  9.0  9.0
       0.5 9999.0 999.0 999.0 999.0 9999.0 9999.0 999.0 999.0 999.0   72.380  -7.761 999.0 999.0 99999.0  9.0  9.0  9.0  9.0  9.0  9.0
       0.3 9999.0 999.0 999.0 999.0 9999.0 9999.0 999.0 999.0 999.0 9999.000 999.000 999.0 999.0 99999.0  9.0  9.0  9.0  9.0  9.0  9.0
       0.0 9999.0 999.0 999.0 999.0 9999.0 9999.0 999.0 999.0 999.0   72.381  -7.760 999.0 999.0 99999.0  9.0  9.0  9.0  9.0  9.0  9.0
      -1.0  906.7  20.0  18.9  92.8  -10.1   -3.7  10.7  70.0 999.0   72.380  -7.761 999.0 999.0   953.8  1.0  1.0  1.0  1.0  1.0  9.0
    Data Type:                         AVAPS SOUNDING DATA, Channel 2/Descending
    Project ID:                        DYNAMO
    Release Site Type/Site ID:         NOAA P3/N43RF 20111116I1
    Release Location (lon,lat,alt):    072 12.04'E, 08 11.50'S, 72.201, -8.192, 966.4
    UTC Release Time (y,m,d,h,m,s):    2011, 11, 16, 04:22:07
    Reference Launch Data Source/Time: IWGADTS Format (IWG1)/04:22:07
    Sonde Id:                          110355308
    System Operator/Comments:          TMR/none, Good Drop
    Post Processing Comments:          Aspen Version 3.1; Created on 01 Feb 2012 23:18 UTC; Configuration research-dropsonde
    /
    /
    Nominal Release Time (y,m,d,h,m,s):2011, 11, 16, 04:22:07
     Time  Press  Temp  Dewpt  RH    Ucmp   Vcmp   spd   dir   Wcmp     Lon     Lat   Ele   Azi    Alt    Qp   Qt   Qrh  Qu   Qv   QdZ
      sec    mb     C     C     %     m/s    m/s   m/s   deg   m/s      deg     deg   deg   deg     m    code code code code code code
    ------ ------ ----- ----- ----- ------ ------ ----- ----- ----- -------- ------- ----- ----- ------- ---- ---- ---- ---- ---- ----
      89.8 1011.6  27.3  23.9  81.0 9999.0 9999.0 999.0 999.0 999.0 9999.000 999.000 999.0 999.0     0.0  1.0  1.0  1.0  9.0  9.0  9.0

您可以修改代码,例如...(Python3 警告,如果您想 运行 在 Python2.7+ 中使用此代码,请将 range() 替换为 xrange()(为了提高效率))

def readSacredAttribute(holyInput):
    raw = [ x.strip() for x in holyInput.readline()[:-1].split(':') ]
    newRaw = []
    for i in range(len(raw) - 1):
        for x in [ x.strip() for x in raw[ i + 1 ].split(',') ]:
            newRaw.append(x)

    raw[ 1 : ] = newRaw

    parameters = {}
    if '(' in raw[0]:
        base = raw[0].index('(') + 1
        to = raw[0].index(')')
        splitted = [ x.strip() for x in raw[1].split(',') ]
        for i, x in enumerate([ x.strip() for x in raw[0][ base : to ].split(',') ]):
            parameters[x] = splitted[i]

    return (raw, parameters)

def splitThisStupidMess(holyInput):
    holyHeader = []
    for i in range(6):
        holyHeader.append([ float(x) for x in holyInput.readline().split()])

    sacredAttributes = { x[0][0] : (x[0][1], x[1]) for x in [  readSacredAttribute(holyInput) for i in range(9) ] }

    # Ignore the '\' lines
    for i in range(2):
        holyInput.readline()

    nominalTime = readSacreAttribute(holyInput)
    sacredAttributes[nominalTime[0][0]] = (nominalTime[0][1], nominalTime[1])

    divineNames = holyInput.readline().split()
    divineUnits = holyInput.readline().split()
    holyInput.readline()    # Avoid decoration...
    divineValues = [ float(x) for x in holyInput.readline().split() ]

    divineFooter = { divineNames[i] : (divineUnits[i], divineValues[i]) for i in len(divineNames) }

    return (holyHeader, sacredAttributes, divineFooter)