从 txt 文件获取特定数据到 pandas 数据框
Get specific data from txt file to pandas dataframe
我在 txt 文件中有这样的数据:
Wed Mar 23 16:59:25 GMT 2022
1 State
1 ESTAB
Wed Mar 23 16:59:26 GMT 2022
1 State
1 ESTAB
1 CLOSE-WAIT
Wed Mar 23 16:59:27 GMT 2022
1 State
1 ESTAB
10 FIN-WAIT
Wed Mar 23 16:59:28 GMT 2022
1 State
1 CLOSE-WAIT
102 ESTAB
我想要一个如下所示的 pandas 数据框:
timestamp | State | ESTAB | FIN-WAIT | CLOSE-WAIT
Wed Mar 23 16:59:25 GMT 2022 | 1 | 1 | 0 | 0
Wed Mar 23 16:59:26 GMT 2022 | 1 | 1 | 0 | 1
Wed Mar 23 16:59:27 GMT 2022 | 1 | 1 | 10 | 0
Wed Mar 23 16:59:28 GMT 2022 | 1 | 102 | 0 | 1
这意味着每个段落第一行中的字符串应该用于第一列timestamp
。其他列应根据数字后面的字符串填充数字。下一栏从一段之后开始。
如何使用 pandas 执行此操作?
尝试:
#read text file to a DataFrame
df = pd.read_csv("data.txt", header=None, skip_blank_lines=False)
#Extract possible column names
df["Column"] = df[0].str.extract("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)")
#Remove the column names from the data
df[0] = df[0].str.replace("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)","",regex=True)
df = df.dropna(how="all").fillna("timestamp")
df["Index"] = df["Column"].eq("timestamp").cumsum()
#Pivot the data to match expected output structure
output = df.pivot("Index","Column",0)
#Re-format columns as needed
output = output.set_index("timestamp").astype(float).fillna(0).astype(int).reset_index()
>>> output
Column timestamp CLOSE-WAIT ESTAB FIN-WAIT State
0 Wed Mar 23 16:59:25 GMT 2022 0 1 0 1
1 Wed Mar 23 16:59:26 GMT 2022 1 1 0 1
2 Wed Mar 23 16:59:27 GMT 2022 0 1 10 1
3 Wed Mar 23 16:59:28 GMT 2022 1 102 0 1
首先你可以将txt文件处理成list of list。内部列表意味着每个大块线。外部列表表示不同的帅哥:
import pandas as pd
with open('data.txt', 'r') as f:
res = f.read()
records = [list(map(str.strip, line.strip().split('\n'))) for line in res.split('\n\n')]
print(records)
[['Wed Mar 23 16:59:25 GMT 2022', '1 State', '1 ESTAB'], ['Wed Mar 23 16:59:26 GMT 2022', '1 State', '1 ESTAB', '1 CLOSE-WAIT'], ['Wed Mar 23 16:59:27 GMT 2022', '1 State', '1 ESTAB', '10 FIN-WAIT'], ['Wed Mar 23 16:59:28 GMT 2022', '1 State', '1 CLOSE-WAIT', '102 ESTAB']]
然后你可以通过手动定义每个键和值将列表列表转换为字典列表
l = []
for record in records:
d = {}
d['timestamp'] = record[0]
for r in record[1:]:
key = r.split(' ')[1]
value = r.split(' ')[0]
d[key] = value
l.append(d)
print(l)
[{'timestamp': 'Wed Mar 23 16:59:25 GMT 2022', 'State': '1', 'ESTAB': '1'}, {'timestamp': 'Wed Mar 23 16:59:26 GMT 2022', 'State': '1', 'ESTAB': '1', 'CLOSE-WAIT': '1'}, {'timestamp': 'Wed Mar 23 16:59:27 GMT 2022', 'State': '1', 'ESTAB': '1', 'FIN-WAIT': '10'}, {'timestamp': 'Wed Mar 23 16:59:28 GMT 2022', 'State': '1', 'CLOSE-WAIT': '1', 'ESTAB': '102'}]
最后你可以将这个字典输入数据框并填充 nan 单元格
df = pd.DataFrame(l).fillna(0)
print(df)
timestamp State ESTAB CLOSE-WAIT FIN-WAIT
0 Wed Mar 23 16:59:25 GMT 2022 1 1 0 0
1 Wed Mar 23 16:59:26 GMT 2022 1 1 1 0
2 Wed Mar 23 16:59:27 GMT 2022 1 1 0 10
3 Wed Mar 23 16:59:28 GMT 2022 1 102 1 0
我在 txt 文件中有这样的数据:
Wed Mar 23 16:59:25 GMT 2022
1 State
1 ESTAB
Wed Mar 23 16:59:26 GMT 2022
1 State
1 ESTAB
1 CLOSE-WAIT
Wed Mar 23 16:59:27 GMT 2022
1 State
1 ESTAB
10 FIN-WAIT
Wed Mar 23 16:59:28 GMT 2022
1 State
1 CLOSE-WAIT
102 ESTAB
我想要一个如下所示的 pandas 数据框:
timestamp | State | ESTAB | FIN-WAIT | CLOSE-WAIT
Wed Mar 23 16:59:25 GMT 2022 | 1 | 1 | 0 | 0
Wed Mar 23 16:59:26 GMT 2022 | 1 | 1 | 0 | 1
Wed Mar 23 16:59:27 GMT 2022 | 1 | 1 | 10 | 0
Wed Mar 23 16:59:28 GMT 2022 | 1 | 102 | 0 | 1
这意味着每个段落第一行中的字符串应该用于第一列timestamp
。其他列应根据数字后面的字符串填充数字。下一栏从一段之后开始。
如何使用 pandas 执行此操作?
尝试:
#read text file to a DataFrame
df = pd.read_csv("data.txt", header=None, skip_blank_lines=False)
#Extract possible column names
df["Column"] = df[0].str.extract("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)")
#Remove the column names from the data
df[0] = df[0].str.replace("(State|ESTAB|FIN-WAIT|CLOSE-WAIT)","",regex=True)
df = df.dropna(how="all").fillna("timestamp")
df["Index"] = df["Column"].eq("timestamp").cumsum()
#Pivot the data to match expected output structure
output = df.pivot("Index","Column",0)
#Re-format columns as needed
output = output.set_index("timestamp").astype(float).fillna(0).astype(int).reset_index()
>>> output
Column timestamp CLOSE-WAIT ESTAB FIN-WAIT State
0 Wed Mar 23 16:59:25 GMT 2022 0 1 0 1
1 Wed Mar 23 16:59:26 GMT 2022 1 1 0 1
2 Wed Mar 23 16:59:27 GMT 2022 0 1 10 1
3 Wed Mar 23 16:59:28 GMT 2022 1 102 0 1
首先你可以将txt文件处理成list of list。内部列表意味着每个大块线。外部列表表示不同的帅哥:
import pandas as pd
with open('data.txt', 'r') as f:
res = f.read()
records = [list(map(str.strip, line.strip().split('\n'))) for line in res.split('\n\n')]
print(records)
[['Wed Mar 23 16:59:25 GMT 2022', '1 State', '1 ESTAB'], ['Wed Mar 23 16:59:26 GMT 2022', '1 State', '1 ESTAB', '1 CLOSE-WAIT'], ['Wed Mar 23 16:59:27 GMT 2022', '1 State', '1 ESTAB', '10 FIN-WAIT'], ['Wed Mar 23 16:59:28 GMT 2022', '1 State', '1 CLOSE-WAIT', '102 ESTAB']]
然后你可以通过手动定义每个键和值将列表列表转换为字典列表
l = []
for record in records:
d = {}
d['timestamp'] = record[0]
for r in record[1:]:
key = r.split(' ')[1]
value = r.split(' ')[0]
d[key] = value
l.append(d)
print(l)
[{'timestamp': 'Wed Mar 23 16:59:25 GMT 2022', 'State': '1', 'ESTAB': '1'}, {'timestamp': 'Wed Mar 23 16:59:26 GMT 2022', 'State': '1', 'ESTAB': '1', 'CLOSE-WAIT': '1'}, {'timestamp': 'Wed Mar 23 16:59:27 GMT 2022', 'State': '1', 'ESTAB': '1', 'FIN-WAIT': '10'}, {'timestamp': 'Wed Mar 23 16:59:28 GMT 2022', 'State': '1', 'CLOSE-WAIT': '1', 'ESTAB': '102'}]
最后你可以将这个字典输入数据框并填充 nan 单元格
df = pd.DataFrame(l).fillna(0)
print(df)
timestamp State ESTAB CLOSE-WAIT FIN-WAIT
0 Wed Mar 23 16:59:25 GMT 2022 1 1 0 0
1 Wed Mar 23 16:59:26 GMT 2022 1 1 1 0
2 Wed Mar 23 16:59:27 GMT 2022 1 1 0 10
3 Wed Mar 23 16:59:28 GMT 2022 1 102 1 0