如何使用重复值解析和制表文件

How to parse and tabulate file with repeated values

我正在尝试将文本文件制成表格,格式如下。它喜欢出现多次的数据块。前 5 个字段通常在每个信息块和输出中出现一次,我想让它们填写(绿色值)。

SOME TEXT

SOME TEXT

SOME TEXT
GSHSH = 0 OK:SUCCESS

                ABC = 1
                TDE = 0
            TNLH = WL_CS
            TKKJW = ZZR
            MBTYIE = PRM
            MHGT = 165
            MRLL = CTM
            TTDDX = 0
            ZDTR = FALSE
            UEEM = FALSE
            KQTY = FALSE

            MHGT = 211
            MRLL = CTM
            TTDDX = 0
            ZDTR = FALSE
            UEEM = FALSE
            KQTY = FALSE

            MHGT = 32
            MRLL = CTM
            TTDDX = 0
            ZDTR = FALSE
            UEEM = FALSE
            KQTY = FALSE


SOME TEXT

SOME TEXT

SOME TEXT
GSHSH = 23 OK:SUCCESS

                ABC = 1
                TDE = 0
            TNLH = WL_PS
            KKJW = ZZZN
            MBTYIE = PRM
            MHGT = 9254
            MRLL = PRM
            ZDTR = FALSE
            UEEM = FALSE
            KQTY = FALSE


SOME TEXT

SOME TEXT

SOME TEXT
GSHSH = 0 OK:SUCCESS

                ABC = 1
                TDE = 1
            TNLH = RTC_RMN
            TKKJW = ZZR
            BTYIE = RTC
            MHGT = 1150
            MRLL = PRM
            ZDTR = FALSE
            UEEM = FALSE
            KQTY = FALSE

            MHGT = 41
            MRLL = CTM
            TTDDX = 0
            ZDTR = FALSE
            UEEM = FALSE
            KQTY = FALSE

SOME TEXT

SOME TEXT

SOME TEXT
GSHSH = 1 OK:SUCCESS

我想要的输出是这样的:

我当前的代码如下所示,我能够读取数据并将值存储在 defaultdict 中。之后,我尝试转换为 pandas 数据框,但出现错误。而且我坚持如何组织要在正确列中打印的值。感谢您的帮助

import re
from collections import defaultdict
from tabulate import tabulate
import pandas as pd

file = 'file.txt'
f=open(file,"r").read().splitlines()

lst=[]
for line in f:
    if re.match(r'[ \t]', line):
        lst.append(line.replace(' ', '').split('='))

print(lst)

d = defaultdict(list)
for k, v in lst:
    d[k].append(v)

>>> d
defaultdict(<class 'list'>, {'ABC': ['1', '1', '1'], 'TDE': ['0', '0', '1'], 'TNLH': ['WL_CS', 
'WL_PS', 'RTC_RMN'], 'TKKJW': ['ZZR', 'ZZR'], 'MBTYIE': ['PRM', 'PRM'], 'MHGT': ['165', '211', 
'32', '9254', '1150', '41'], 'MRLL': ['CTM', 'CTM', 'CTM', 'PRM', 'PRM', 'CTM'], 'TTDDX': 
['0', '0', '0', '0'], 'ZDTR': ['FALSE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'FALSE'], 'UEEM': 
['FALSE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'FALSE'], 'KQTY': ['FALSE', 'FALSE', 'FALSE', 
'FALSE', 'FALSE', 'FALSE'], 'KKJW': ['ZZZN'], 'BTYIE': ['RTC']}) 

df = pd.DataFrame.from_dict(d)

>> ValueError: arrays must all be same length

尝试:

file = 'file.txt'
f=open(file,"r").read().splitlines()

lst=[]

data = {}
dfs = []
group = 1
for line in f:
    if line.endswith('SUCCESS'):
        print(f'============== {line} ================')
        if data:
            df = pd.DataFrame.from_dict(data)
            df = pd.pivot(data=df, columns=['col'], values=['val'], index=['group']).reset_index(drop=True)
            df.columns = df.columns.droplevel()
            df.fillna(method='ffill', inplace=True)
            dfs.append(df)

        data = []
        group = 1

    else:
        if re.match(r'[ \t]', line):
            split_data = line.replace(' ', '').split('=')
            data.append({'group': group, 'col': split_data[0], 'val': split_data[1]})

        if not line.strip():
            group+=1


cols_order = ['ABC', 'TDE', 'TNLH', 'TKKJW', 'MBTYIE', 'MHGT', 'MRLL', 'TTDDX', 'ZDTR', 'UEEM', 'KQTY']
fina_df = pd.concat(dfs, ignore_index=True)
fina_df['TKKJW'].fillna(fina_df['KKJW'], inplace=True)
fina_df['MBTYIE'].fillna(fina_df['BTYIE'], inplace=True)
fina_df = fina_df[cols_order]

输出:

col ABC TDE     TNLH TKKJW MBTYIE  MHGT MRLL TTDDX   ZDTR   UEEM   KQTY
0     1   0    WL_CS   ZZR    PRM   165  CTM     0  FALSE  FALSE  FALSE
1     1   0    WL_CS   ZZR    PRM   211  CTM     0  FALSE  FALSE  FALSE
2     1   0    WL_CS   ZZR    PRM    32  CTM     0  FALSE  FALSE  FALSE
3     1   0    WL_PS  ZZZN    PRM  9254  PRM   NaN  FALSE  FALSE  FALSE
4     1   1  RTC_RMN   ZZR    RTC  1150  PRM   NaN  FALSE  FALSE  FALSE
5     1   1  RTC_RMN   ZZR    RTC    41  CTM     0  FALSE  FALSE  FALSE