导入文本文件时无法用“,”分隔符分隔数据
Unable to separate data with ',' separator while importing text file
我的数据有重复模式:
2021-11-17 10:59:10.880
SysState: 4, Events: 161403, EMS: 4, VDB: 2, TubeState: 0x02
BDR Mode: 1, BMS Ext: 2, BMS Int: 0, BdrStat: 00
CPU(%):16, CPUmax(%):47, task idx:3, CPUmaxIRQ(%):0
SOC:9973, SOH:100, LV?:0, HV?:0
3330mV 3333mV 3332mV 3332mV 3331mV 0 0
3331mV 3324mV 3325mV 3325mV 3328mV 0 0
3325mV 3321mV 3328mV 3328mV 3327mV 0 0
3329mV 0mV 0mV 0mV 0mV 0 0
BPV:53288, PLV:53241, BPC:0, PLC:0
AMBI:421, CONN:278, FETS:282, BMSC:274, BPA1:259, BPA2:237, BPA3:255
2021-11-17 10:59:13.80
SysState: 4, Events: 161407, EMS: 4, VDB: 3, TubeState: 0x08
BDR Mode: 4, BMS Ext: 3, BMS Int: 1, BdrStat: 00
CPU(%):12, CPUmax(%):47, task idx:3, CPUmaxIRQ(%):0
SOC:9973, SOH:100, LV?:0, HV?:0
3332mV 3331mV 3332mV 3332mV 3331mV 0 0
3331mV 3324mV 3325mV 3326mV 3328mV 0 0
3324mV 3321mV 3328mV 3328mV 3327mV 0 0
3329mV 0mV 0mV 0mV 0mV 0 0
BPV:53288, PLV:53277, BPC:23, PLC:0
AMBI:421, CONN:278, FETS:282, BMSC:276, BPA1:259, BPA2:237, BPA3:255
2021-11-17 10:59:15.280
SysState: 4, Events: 161407, EMS: 4, VDB: 3, TubeState: 0x08
BDR Mode: 4, BMS Ext: 3, BMS Int: 1, BdrStat: 00
CPU(%):11, CPUmax(%):47, task idx:3, CPUmaxIRQ(%):0
SOC:9973, SOH:100, LV?:0, HV?:0
3331mV 3332mV 3331mV 3332mV 3331mV 0 0
3331mV 3324mV 3325mV 3325mV 3328mV 0 0
3324mV 3322mV 3328mV 3328mV 3327mV 0 0
3331mV 0mV 0mV 0mV 0mV 0 0
BPV:53288, PLV:53259, BPC:47, PLC:47
AMBI:421, CONN:278, FETS:282, BMSC:276, BPA1:259, BPA2:237, BPA3:255
我想要做的是将每个值分开,并使其成为从 '2021-11-17 10:59:10.880' 到 'BPA3:255'
Index
Another header
Another header
Another header
0
2021-11-17 10:59:10.880
SysState: 4
Events: 161403
1
2021-11-17 10:59:13.80
SysState: 4
Events: 1161407
等等……
到目前为止做了什么:
The file was a .txt file and I converted it into csv first and then:
df = pd.read_csv('data.csv', sep=',' )
但它给了我 ParserError: 错误标记数据。有人知道如何解决这个问题吗?与 sep= ';'或者将文本文件更改为 csv 会得到以下输出:
有没有办法在解析文本文件而不是将其转换为 csv 时解决此问题?
假设您只需要日期、SysState 和事件,一种简单的方法是使用正则表达式提取信息。
我还假设文件不是很大,所以我将全部加载到内存中,如果不是这种情况,那么您将不得不逐行解析。
with open('filename.csv') as f:
lines = f.read()
import re
regex = re.compile('(\d{4}-\d\d-\d\d \d\d:\d\d:\d\d\.\d+)\nSysState: (\d+),\s+Events: (\d+).+?')
df = pd.DataFrame(regex.findall(lines), columns=['datetime', 'SysState', 'Events'])
注意。我只从字段中提取数字,但如果你真的想要 SysState: 4
等,很容易将它添加到捕获组
输出:
datetime SysState Events
0 2021-11-17 10:59:10.880 4 161403
1 2021-11-17 10:59:13.80 4 161407
2 2021-11-17 10:59:15.280 4 161407
我的数据有重复模式:
2021-11-17 10:59:10.880
SysState: 4, Events: 161403, EMS: 4, VDB: 2, TubeState: 0x02
BDR Mode: 1, BMS Ext: 2, BMS Int: 0, BdrStat: 00
CPU(%):16, CPUmax(%):47, task idx:3, CPUmaxIRQ(%):0
SOC:9973, SOH:100, LV?:0, HV?:0
3330mV 3333mV 3332mV 3332mV 3331mV 0 0
3331mV 3324mV 3325mV 3325mV 3328mV 0 0
3325mV 3321mV 3328mV 3328mV 3327mV 0 0
3329mV 0mV 0mV 0mV 0mV 0 0
BPV:53288, PLV:53241, BPC:0, PLC:0
AMBI:421, CONN:278, FETS:282, BMSC:274, BPA1:259, BPA2:237, BPA3:255
2021-11-17 10:59:13.80
SysState: 4, Events: 161407, EMS: 4, VDB: 3, TubeState: 0x08
BDR Mode: 4, BMS Ext: 3, BMS Int: 1, BdrStat: 00
CPU(%):12, CPUmax(%):47, task idx:3, CPUmaxIRQ(%):0
SOC:9973, SOH:100, LV?:0, HV?:0
3332mV 3331mV 3332mV 3332mV 3331mV 0 0
3331mV 3324mV 3325mV 3326mV 3328mV 0 0
3324mV 3321mV 3328mV 3328mV 3327mV 0 0
3329mV 0mV 0mV 0mV 0mV 0 0
BPV:53288, PLV:53277, BPC:23, PLC:0
AMBI:421, CONN:278, FETS:282, BMSC:276, BPA1:259, BPA2:237, BPA3:255
2021-11-17 10:59:15.280
SysState: 4, Events: 161407, EMS: 4, VDB: 3, TubeState: 0x08
BDR Mode: 4, BMS Ext: 3, BMS Int: 1, BdrStat: 00
CPU(%):11, CPUmax(%):47, task idx:3, CPUmaxIRQ(%):0
SOC:9973, SOH:100, LV?:0, HV?:0
3331mV 3332mV 3331mV 3332mV 3331mV 0 0
3331mV 3324mV 3325mV 3325mV 3328mV 0 0
3324mV 3322mV 3328mV 3328mV 3327mV 0 0
3331mV 0mV 0mV 0mV 0mV 0 0
BPV:53288, PLV:53259, BPC:47, PLC:47
AMBI:421, CONN:278, FETS:282, BMSC:276, BPA1:259, BPA2:237, BPA3:255
我想要做的是将每个值分开,并使其成为从 '2021-11-17 10:59:10.880' 到 'BPA3:255'
Index | Another header | Another header | Another header |
---|---|---|---|
0 | 2021-11-17 10:59:10.880 | SysState: 4 | Events: 161403 |
1 | 2021-11-17 10:59:13.80 | SysState: 4 | Events: 1161407 |
等等……
到目前为止做了什么:
The file was a .txt file and I converted it into csv first and then:
df = pd.read_csv('data.csv', sep=',' )
但它给了我 ParserError: 错误标记数据。有人知道如何解决这个问题吗?与 sep= ';'或者将文本文件更改为 csv 会得到以下输出:
有没有办法在解析文本文件而不是将其转换为 csv 时解决此问题?
假设您只需要日期、SysState 和事件,一种简单的方法是使用正则表达式提取信息。
我还假设文件不是很大,所以我将全部加载到内存中,如果不是这种情况,那么您将不得不逐行解析。
with open('filename.csv') as f:
lines = f.read()
import re
regex = re.compile('(\d{4}-\d\d-\d\d \d\d:\d\d:\d\d\.\d+)\nSysState: (\d+),\s+Events: (\d+).+?')
df = pd.DataFrame(regex.findall(lines), columns=['datetime', 'SysState', 'Events'])
注意。我只从字段中提取数字,但如果你真的想要 SysState: 4
等,很容易将它添加到捕获组
输出:
datetime SysState Events
0 2021-11-17 10:59:10.880 4 161403
1 2021-11-17 10:59:13.80 4 161407
2 2021-11-17 10:59:15.280 4 161407