从 txt 文件获取特定数据到 pandas 数据框

Get specific data from txt file to pandas dataframe

我在 txt 文件中有这样的数据:

Wed Mar 23 16:59:25 GMT 2022
      1 State
      1 ESTAB

Wed Mar 23 16:59:26 GMT 2022
      1 State
      1 ESTAB
      1 CLOSE-WAIT

Wed Mar 23 16:59:27 GMT 2022
      1 State
      1 ESTAB
      10 FIN-WAIT

Wed Mar 23 16:59:28 GMT 2022
      1 State
      1 CLOSE-WAIT
      102 ESTAB

我想要一个如下所示的 pandas 数据框:

timestamp | State | ESTAB | FIN-WAIT | CLOSE-WAIT
Wed Mar 23 16:59:25 GMT 2022 | 1 | 1 | 0 | 0
Wed Mar 23 16:59:26 GMT 2022 | 1 | 1 | 0 | 1
Wed Mar 23 16:59:27 GMT 2022 | 1 | 1 | 10 | 0
Wed Mar 23 16:59:28 GMT 2022 | 1 | 102 | 0 | 1

这意味着每个段落第一行中的字符串应该用于第一列timestamp。其他列应根据数字后面的字符串填充数字。下一栏从一段之后开始。

如何使用 pandas 执行此操作?

尝试:

#read text file to a DataFrame
df = pd.read_csv("data.txt", header=None, skip_blank_lines=False)

#Extract possible column names
df["Column"] = df[0].str.extract("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)")

#Remove the column names from the data
df[0] = df[0].str.replace("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)","",regex=True)

df = df.dropna(how="all").fillna("timestamp")
df["Index"] = df["Column"].eq("timestamp").cumsum()

#Pivot the data to match expected output structure
output = df.pivot("Index","Column",0)

#Re-format columns as needed
output = output.set_index("timestamp").astype(float).fillna(0).astype(int).reset_index()

>>> output
Column                     timestamp  CLOSE-WAIT  ESTAB  FIN-WAIT  State
0       Wed Mar 23 16:59:25 GMT 2022           0      1         0      1
1       Wed Mar 23 16:59:26 GMT 2022           1      1         0      1
2       Wed Mar 23 16:59:27 GMT 2022           0      1        10      1
3       Wed Mar 23 16:59:28 GMT 2022           1    102         0      1

首先你可以将txt文件处理成list of list。内部列表意味着每个大块线。外部列表表示不同的帅哥:

import pandas as pd

with open('data.txt', 'r') as f:
    res = f.read()

records = [list(map(str.strip, line.strip().split('\n'))) for line in res.split('\n\n')]
print(records)

[['Wed Mar 23 16:59:25 GMT 2022', '1 State', '1 ESTAB'], ['Wed Mar 23 16:59:26 GMT 2022', '1 State', '1 ESTAB', '1 CLOSE-WAIT'], ['Wed Mar 23 16:59:27 GMT 2022', '1 State', '1 ESTAB', '10 FIN-WAIT'], ['Wed Mar 23 16:59:28 GMT 2022', '1 State', '1 CLOSE-WAIT', '102 ESTAB']]

然后你可以通过手动定义每个键和值将列表列表转换为字典列表

l = []
for record in records:
    d = {}
    d['timestamp'] = record[0]
    for r in record[1:]:
        key = r.split(' ')[1]
        value = r.split(' ')[0]
        d[key] = value

    l.append(d)
print(l)

[{'timestamp': 'Wed Mar 23 16:59:25 GMT 2022', 'State': '1', 'ESTAB': '1'}, {'timestamp': 'Wed Mar 23 16:59:26 GMT 2022', 'State': '1', 'ESTAB': '1', 'CLOSE-WAIT': '1'}, {'timestamp': 'Wed Mar 23 16:59:27 GMT 2022', 'State': '1', 'ESTAB': '1', 'FIN-WAIT': '10'}, {'timestamp': 'Wed Mar 23 16:59:28 GMT 2022', 'State': '1', 'CLOSE-WAIT': '1', 'ESTAB': '102'}]

最后你可以将这个字典输入数据框并填充 nan 单元格

df = pd.DataFrame(l).fillna(0)
print(df)

                      timestamp State ESTAB CLOSE-WAIT FIN-WAIT
0  Wed Mar 23 16:59:25 GMT 2022     1     1          0        0
1  Wed Mar 23 16:59:26 GMT 2022     1     1          1        0
2  Wed Mar 23 16:59:27 GMT 2022     1     1          0       10
3  Wed Mar 23 16:59:28 GMT 2022     1   102          1        0