使用时间戳作为新行而不是 \n 拆分平面文件

Question

我有一个平面文件，我正在尝试使用 read_csv 创建一个数据框，我想使用时间戳作为新行而不是新行的开始点。我不想使用 \n，因为当平面文件中出现错误时，错误会占用多行，但它们仍然以单个时间戳开头。我试过了：

df = pd.read_csv('myfile.file',header=None,sep='\d{4}-\d{2}-\d{2}')

这做了一些我不希望它做的事情 - 首先，它完全删除了 yyyy-mm-dd，只留下时间部分（我需要保留它），其次它没有'它似乎实际上只在时间戳处分裂，它仍然在它找到的地方分裂它 \n

示例：正常线路：

2022-05-16 hh:mm:ss here's a normal line in the file \n

错误行：

2022-05-16 hh:mm:ss here's an error \nhere's error details \nit's a new line even though it's the same error \nbut it just has one timestamp

正常线电流输出：

hh:mm:ss here's a normal line in the file

正常线路所需输出：

2022-05-16 hh:mm:ss here's a normal line in the file

错误线电流输出：

hh:mm:ss here's an error
here's error details
it's a new line even though it's the same error
but it just has one timestamp

错误行所需输出：

2022-05-16 hh:mm:ss here's an error here's error details it's a new line even though it's the same error but it just has one timestamp

Answer 1

一种方法是读入文件然后修复它。这种方法找到所有以时间戳开头的行，按它们分组，然后连接该组中的所有字符串。

数据：

str_io = io.StringIO(
'''2022-05-16 01:mm:ss here's a normal line in the file
2022-05-16 02:mm:ss here's an error
 here's error details
 it's a new line even though it's the same error
 but it just has one timestamp
2022-05-16 03:mm:ss here's a another normal line in the file
2022-05-16 04:mm:ss here's a third normal line in the file'''
)

（您必须为 use-case 设置 sep=）。不能再在 1.4.2 中执行 sep='\n'。您可以使用文件中不存在的字符（例如 back-tick）来读取整行。

df = pd.read_csv(str_io, header=None)
df

0  2022-05-16 01:mm:ss here's a normal line in th...
1                2022-05-16 02:mm:ss here's an error
2                               here's error details
3    it's a new line even though it's the same error
4                      but it just has one timestamp
5  2022-05-16 03:mm:ss here's a another normal li...
6  2022-05-16 04:mm:ss here's a third normal line...

查找以时间戳开头的行（您会想出比“以‘2’开头”更适合您的 use-case 的内容：

ts_rows = df[0].str.startswith('2')

然后累积错误行文本并将它们附加到前一个时间戳行：

df.assign(ts=df[ts_rows][0].str.slice(0,19)).ffill().groupby('ts') \
    .apply(lambda x: x[0].str.cat(sep=' ')).reset_index(drop=True)

0    2022-05-16 01:mm:ss here's a normal line in th...
1    2022-05-16 02:mm:ss here's an error  here's er...
2    2022-05-16 03:mm:ss here's a another normal li...
3    2022-05-16 04:mm:ss here's a third normal line...

这表明错误行已累积：

df.assign(ts=df[ts_rows][0].str.slice(0,19)).ffill().groupby('ts') \
    .apply(lambda x: x[0].str.cat(sep=' '))[1]
    
"2022-05-16 02:mm:ss here's an error  here's error details  it's a new line even though it's the same error  but it just has one timestamp"

然后您可以将这些行分成 ts 和 message 等列或您的特定应用程序所需的任何内容。

工作原理

首先，时间戳是从包含它们的行中提取的。

df[ts_rows][0].str.slice(0,19)

0    2022-05-16 01:mm:ss
1    2022-05-16 02:mm:ss
5    2022-05-16 03:mm:ss
6    2022-05-16 04:mm:ss

以下将其分配给名为 ts 的新列：

df.assign(ts=df[ts_rows][0].str.slice(0,19))

它等同于以下内容（但 assign 允许您即时执行该分配）。

df['ts'] = df[ts_rows][0].str.slice(0,19)

所以数据框现在看起来像：

                                                   0                   ts
0  2022-05-16 01:mm:ss here's a normal line in th...  2022-05-16 01:mm:ss
1                2022-05-16 02:mm:ss here's an error  2022-05-16 02:mm:ss
2                               here's error details                  NaN
3    it's a new line even though it's the same error                  NaN
4                      but it just has one timestamp                  NaN
5  2022-05-16 03:mm:ss here's a another normal li...  2022-05-16 03:mm:ss
6  2022-05-16 04:mm:ss here's a third normal line...  2022-05-16 04:mm:ss

下一步是向前填充时间戳，以便我们可以将错误行与时间戳相关联：

df.assign(ts=df[ts_rows][0].str.slice(0,19)).ffill()

                                                   0                   ts
0  2022-05-16 01:mm:ss here's a normal line in th...  2022-05-16 01:mm:ss
1                2022-05-16 02:mm:ss here's an error  2022-05-16 02:mm:ss
2                               here's error details  2022-05-16 02:mm:ss
3    it's a new line even though it's the same error  2022-05-16 02:mm:ss
4                      but it just has one timestamp  2022-05-16 02:mm:ss
5  2022-05-16 03:mm:ss here's a another normal li...  2022-05-16 03:mm:ss
6  2022-05-16 04:mm:ss here's a third normal line...  2022-05-16 04:mm:ss

现在我们知道了每个错误行的时间戳，我们可以使用 .str.cat().

对错误字符串进行分组和连接

但首先，.str.cat() 是这样工作的：

errs = pd.Series(['err1','err2','err3'])
errs

0    err1
1    err2
2    err3

errs.str.cat(sep=' ')

'err1 err2 err3'

所以 .groupby('ts').apply(lambda x: x[0].str.cat(sep=' ')) 对时间戳内的所有行执行此操作：

df.assign(ts=df[ts_rows][0].str.slice(0,19)).ffill().groupby('ts') \
    .apply(lambda x: x[0].str.cat(sep=' '))
    
ts
2022-05-16 01:mm:ss    2022-05-16 01:mm:ss here's a normal line in th...
2022-05-16 02:mm:ss    2022-05-16 02:mm:ss here's an error  here's er...
2022-05-16 03:mm:ss    2022-05-16 03:mm:ss here's a another normal li...
2022-05-16 04:mm:ss    2022-05-16 04:mm:ss here's a third normal line...

这将时间戳作为索引，我认为您不需要，所以 .reset_index(drop=True) 摆脱它：

df.assign(ts=df[ts_rows][0].str.slice(0,19)).ffill().groupby('ts') \
    .apply(lambda x: x[0].str.cat(sep=' ')).reset_index(drop=True)
    
0    2022-05-16 01:mm:ss here's a normal line in th...
1    2022-05-16 02:mm:ss here's an error  here's er...
2    2022-05-16 03:mm:ss here's a another normal li...
3    2022-05-16 04:mm:ss here's a third normal line...

使用时间戳作为新行而不是 \n 拆分平面文件

splitting a flatfile using timestamp as new line instead of \n

python

regex

pandas