使用时间戳作为新行而不是 \n 拆分平面文件
splitting a flatfile using timestamp as new line instead of \n
我有一个平面文件,我正在尝试使用 read_csv 创建一个数据框,我想使用时间戳作为新行而不是新行的开始点。我不想使用 \n,因为当平面文件中出现错误时,错误会占用多行,但它们仍然以单个时间戳开头。我试过了:
df = pd.read_csv('myfile.file',header=None,sep='\d{4}-\d{2}-\d{2}')
这做了一些我不希望它做的事情 - 首先,它完全删除了 yyyy-mm-dd,只留下时间部分(我需要保留它),其次它没有'它似乎实际上只在时间戳处分裂,它仍然在它找到的地方分裂它 \n
示例:
正常线路:
2022-05-16 hh:mm:ss here's a normal line in the file \n
错误行:
2022-05-16 hh:mm:ss here's an error \nhere's error details \nit's a new line even though it's the same error \nbut it just has one timestamp
正常线电流输出:
hh:mm:ss here's a normal line in the file
正常线路所需输出:
2022-05-16 hh:mm:ss here's a normal line in the file
错误线电流输出:
hh:mm:ss here's an error
here's error details
it's a new line even though it's the same error
but it just has one timestamp
错误行所需输出:
2022-05-16 hh:mm:ss here's an error here's error details it's a new line even though it's the same error but it just has one timestamp
一种方法是读入文件然后修复它。这种方法找到所有以时间戳开头的行,按它们分组,然后连接该组中的所有字符串。
数据:
str_io = io.StringIO(
'''2022-05-16 01:mm:ss here's a normal line in the file
2022-05-16 02:mm:ss here's an error
here's error details
it's a new line even though it's the same error
but it just has one timestamp
2022-05-16 03:mm:ss here's a another normal line in the file
2022-05-16 04:mm:ss here's a third normal line in the file'''
)
(您必须为 use-case 设置 sep=
)。不能再在 1.4.2
中执行 sep='\n'
。您可以使用文件中不存在的字符(例如 back-tick)来读取整行。
df = pd.read_csv(str_io, header=None)
df
0 2022-05-16 01:mm:ss here's a normal line in th...
1 2022-05-16 02:mm:ss here's an error
2 here's error details
3 it's a new line even though it's the same error
4 but it just has one timestamp
5 2022-05-16 03:mm:ss here's a another normal li...
6 2022-05-16 04:mm:ss here's a third normal line...
查找以时间戳开头的行(您会想出比“以‘2’开头”更适合您的 use-case 的内容:
ts_rows = df[0].str.startswith('2')
然后累积错误行文本并将它们附加到前一个时间戳行:
df.assign(ts=df[ts_rows][0].str.slice(0,19)).ffill().groupby('ts') \
.apply(lambda x: x[0].str.cat(sep=' ')).reset_index(drop=True)
0 2022-05-16 01:mm:ss here's a normal line in th...
1 2022-05-16 02:mm:ss here's an error here's er...
2 2022-05-16 03:mm:ss here's a another normal li...
3 2022-05-16 04:mm:ss here's a third normal line...
这表明错误行已累积:
df.assign(ts=df[ts_rows][0].str.slice(0,19)).ffill().groupby('ts') \
.apply(lambda x: x[0].str.cat(sep=' '))[1]
"2022-05-16 02:mm:ss here's an error here's error details it's a new line even though it's the same error but it just has one timestamp"
然后您可以将这些行分成 ts
和 message
等列或您的特定应用程序所需的任何内容。
工作原理
首先,时间戳是从包含它们的行中提取的。
df[ts_rows][0].str.slice(0,19)
0 2022-05-16 01:mm:ss
1 2022-05-16 02:mm:ss
5 2022-05-16 03:mm:ss
6 2022-05-16 04:mm:ss
以下将其分配给名为 ts
的新列:
df.assign(ts=df[ts_rows][0].str.slice(0,19))
它等同于以下内容(但 assign
允许您即时执行该分配)。
df['ts'] = df[ts_rows][0].str.slice(0,19)
所以数据框现在看起来像:
0 ts
0 2022-05-16 01:mm:ss here's a normal line in th... 2022-05-16 01:mm:ss
1 2022-05-16 02:mm:ss here's an error 2022-05-16 02:mm:ss
2 here's error details NaN
3 it's a new line even though it's the same error NaN
4 but it just has one timestamp NaN
5 2022-05-16 03:mm:ss here's a another normal li... 2022-05-16 03:mm:ss
6 2022-05-16 04:mm:ss here's a third normal line... 2022-05-16 04:mm:ss
下一步是向前填充时间戳,以便我们可以将错误行与时间戳相关联:
df.assign(ts=df[ts_rows][0].str.slice(0,19)).ffill()
0 ts
0 2022-05-16 01:mm:ss here's a normal line in th... 2022-05-16 01:mm:ss
1 2022-05-16 02:mm:ss here's an error 2022-05-16 02:mm:ss
2 here's error details 2022-05-16 02:mm:ss
3 it's a new line even though it's the same error 2022-05-16 02:mm:ss
4 but it just has one timestamp 2022-05-16 02:mm:ss
5 2022-05-16 03:mm:ss here's a another normal li... 2022-05-16 03:mm:ss
6 2022-05-16 04:mm:ss here's a third normal line... 2022-05-16 04:mm:ss
现在我们知道了每个错误行的时间戳,我们可以使用 .str.cat()
.
对错误字符串进行分组和连接
但首先,.str.cat()
是这样工作的:
errs = pd.Series(['err1','err2','err3'])
errs
0 err1
1 err2
2 err3
errs.str.cat(sep=' ')
'err1 err2 err3'
所以 .groupby('ts').apply(lambda x: x[0].str.cat(sep=' '))
对时间戳内的所有行执行此操作:
df.assign(ts=df[ts_rows][0].str.slice(0,19)).ffill().groupby('ts') \
.apply(lambda x: x[0].str.cat(sep=' '))
ts
2022-05-16 01:mm:ss 2022-05-16 01:mm:ss here's a normal line in th...
2022-05-16 02:mm:ss 2022-05-16 02:mm:ss here's an error here's er...
2022-05-16 03:mm:ss 2022-05-16 03:mm:ss here's a another normal li...
2022-05-16 04:mm:ss 2022-05-16 04:mm:ss here's a third normal line...
这将时间戳作为索引,我认为您不需要,所以 .reset_index(drop=True)
摆脱它:
df.assign(ts=df[ts_rows][0].str.slice(0,19)).ffill().groupby('ts') \
.apply(lambda x: x[0].str.cat(sep=' ')).reset_index(drop=True)
0 2022-05-16 01:mm:ss here's a normal line in th...
1 2022-05-16 02:mm:ss here's an error here's er...
2 2022-05-16 03:mm:ss here's a another normal li...
3 2022-05-16 04:mm:ss here's a third normal line...
我有一个平面文件,我正在尝试使用 read_csv 创建一个数据框,我想使用时间戳作为新行而不是新行的开始点。我不想使用 \n,因为当平面文件中出现错误时,错误会占用多行,但它们仍然以单个时间戳开头。我试过了:
df = pd.read_csv('myfile.file',header=None,sep='\d{4}-\d{2}-\d{2}')
这做了一些我不希望它做的事情 - 首先,它完全删除了 yyyy-mm-dd,只留下时间部分(我需要保留它),其次它没有'它似乎实际上只在时间戳处分裂,它仍然在它找到的地方分裂它 \n
示例: 正常线路:
2022-05-16 hh:mm:ss here's a normal line in the file \n
错误行:
2022-05-16 hh:mm:ss here's an error \nhere's error details \nit's a new line even though it's the same error \nbut it just has one timestamp
正常线电流输出:
hh:mm:ss here's a normal line in the file
正常线路所需输出:
2022-05-16 hh:mm:ss here's a normal line in the file
错误线电流输出:
hh:mm:ss here's an error
here's error details
it's a new line even though it's the same error
but it just has one timestamp
错误行所需输出:
2022-05-16 hh:mm:ss here's an error here's error details it's a new line even though it's the same error but it just has one timestamp
一种方法是读入文件然后修复它。这种方法找到所有以时间戳开头的行,按它们分组,然后连接该组中的所有字符串。
数据:
str_io = io.StringIO(
'''2022-05-16 01:mm:ss here's a normal line in the file
2022-05-16 02:mm:ss here's an error
here's error details
it's a new line even though it's the same error
but it just has one timestamp
2022-05-16 03:mm:ss here's a another normal line in the file
2022-05-16 04:mm:ss here's a third normal line in the file'''
)
(您必须为 use-case 设置 sep=
)。不能再在 1.4.2
中执行 sep='\n'
。您可以使用文件中不存在的字符(例如 back-tick)来读取整行。
df = pd.read_csv(str_io, header=None)
df
0 2022-05-16 01:mm:ss here's a normal line in th...
1 2022-05-16 02:mm:ss here's an error
2 here's error details
3 it's a new line even though it's the same error
4 but it just has one timestamp
5 2022-05-16 03:mm:ss here's a another normal li...
6 2022-05-16 04:mm:ss here's a third normal line...
查找以时间戳开头的行(您会想出比“以‘2’开头”更适合您的 use-case 的内容:
ts_rows = df[0].str.startswith('2')
然后累积错误行文本并将它们附加到前一个时间戳行:
df.assign(ts=df[ts_rows][0].str.slice(0,19)).ffill().groupby('ts') \
.apply(lambda x: x[0].str.cat(sep=' ')).reset_index(drop=True)
0 2022-05-16 01:mm:ss here's a normal line in th...
1 2022-05-16 02:mm:ss here's an error here's er...
2 2022-05-16 03:mm:ss here's a another normal li...
3 2022-05-16 04:mm:ss here's a third normal line...
这表明错误行已累积:
df.assign(ts=df[ts_rows][0].str.slice(0,19)).ffill().groupby('ts') \
.apply(lambda x: x[0].str.cat(sep=' '))[1]
"2022-05-16 02:mm:ss here's an error here's error details it's a new line even though it's the same error but it just has one timestamp"
然后您可以将这些行分成 ts
和 message
等列或您的特定应用程序所需的任何内容。
工作原理
首先,时间戳是从包含它们的行中提取的。
df[ts_rows][0].str.slice(0,19)
0 2022-05-16 01:mm:ss
1 2022-05-16 02:mm:ss
5 2022-05-16 03:mm:ss
6 2022-05-16 04:mm:ss
以下将其分配给名为 ts
的新列:
df.assign(ts=df[ts_rows][0].str.slice(0,19))
它等同于以下内容(但 assign
允许您即时执行该分配)。
df['ts'] = df[ts_rows][0].str.slice(0,19)
所以数据框现在看起来像:
0 ts
0 2022-05-16 01:mm:ss here's a normal line in th... 2022-05-16 01:mm:ss
1 2022-05-16 02:mm:ss here's an error 2022-05-16 02:mm:ss
2 here's error details NaN
3 it's a new line even though it's the same error NaN
4 but it just has one timestamp NaN
5 2022-05-16 03:mm:ss here's a another normal li... 2022-05-16 03:mm:ss
6 2022-05-16 04:mm:ss here's a third normal line... 2022-05-16 04:mm:ss
下一步是向前填充时间戳,以便我们可以将错误行与时间戳相关联:
df.assign(ts=df[ts_rows][0].str.slice(0,19)).ffill()
0 ts
0 2022-05-16 01:mm:ss here's a normal line in th... 2022-05-16 01:mm:ss
1 2022-05-16 02:mm:ss here's an error 2022-05-16 02:mm:ss
2 here's error details 2022-05-16 02:mm:ss
3 it's a new line even though it's the same error 2022-05-16 02:mm:ss
4 but it just has one timestamp 2022-05-16 02:mm:ss
5 2022-05-16 03:mm:ss here's a another normal li... 2022-05-16 03:mm:ss
6 2022-05-16 04:mm:ss here's a third normal line... 2022-05-16 04:mm:ss
现在我们知道了每个错误行的时间戳,我们可以使用 .str.cat()
.
但首先,.str.cat()
是这样工作的:
errs = pd.Series(['err1','err2','err3'])
errs
0 err1
1 err2
2 err3
errs.str.cat(sep=' ')
'err1 err2 err3'
所以 .groupby('ts').apply(lambda x: x[0].str.cat(sep=' '))
对时间戳内的所有行执行此操作:
df.assign(ts=df[ts_rows][0].str.slice(0,19)).ffill().groupby('ts') \
.apply(lambda x: x[0].str.cat(sep=' '))
ts
2022-05-16 01:mm:ss 2022-05-16 01:mm:ss here's a normal line in th...
2022-05-16 02:mm:ss 2022-05-16 02:mm:ss here's an error here's er...
2022-05-16 03:mm:ss 2022-05-16 03:mm:ss here's a another normal li...
2022-05-16 04:mm:ss 2022-05-16 04:mm:ss here's a third normal line...
这将时间戳作为索引,我认为您不需要,所以 .reset_index(drop=True)
摆脱它:
df.assign(ts=df[ts_rows][0].str.slice(0,19)).ffill().groupby('ts') \
.apply(lambda x: x[0].str.cat(sep=' ')).reset_index(drop=True)
0 2022-05-16 01:mm:ss here's a normal line in th...
1 2022-05-16 02:mm:ss here's an error here's er...
2 2022-05-16 03:mm:ss here's a another normal li...
3 2022-05-16 04:mm:ss here's a third normal line...