从原始输入创建 DataFrame

Question

我得到的数据如下：-

[=10=]11:0524-08-2021
[=10=]21:0624-08-2021
&0011:0724-08-2021
&0021:0924-08-2021
[=10=]31:3124-08-2021
&0031:3224-08-2021
[=10=]41:3924-08-2021
&0041:3924-08-2021
[=10=]12:3124-08-2021
&0012:3324-08-2021

在[=12=]11:0524-08-2021中，$表示字符串开头，001表示ID，1:05表示时间，24-08-2021表示日期。类似地 &0011:0624-08-2021 除了 & 表示字符串结束之外，其他都一样。

根据上面的数据我想创建一个数据框如下：-

1. [=11=]11:0524-08-2021   &0011:0724-08-2021
2. [=11=]21:0624-08-2021   &0021:0924-08-2021
3. [=11=]31:3124-08-2021   &0031:3224-08-2021
4. [=11=]41:3924-08-2021   &0041:3924-08-2021
5. [=11=]12:3124-08-2021   &0012:3324-08-2021

基本上我想将条目排序到数据框中，如上所示。这样做必须满足的条件很少。

1.) Column1 应该只有 $ 个条目，Column2 应该只有 & 个条目。

2.) 两列都应按时间递增的顺序排列。第 1 列有 $ 个条目应按时间递增的顺序排列，具有 & 个条目的 column2 也是如此。

Answer 1

如果您得到示例中显示的行，您可以尝试：

import pandas as pd


def process_lines(lines):
    buffer = {}
    for line in map(str.strip, lines):
        id_ = line[1:4]
        if line[0] == "$":
            buffer[id_] = line
        elif line[0] == "&" and buffer.get(id_):
            yield buffer[id_], line
            del buffer[id_]


txt = """[=10=]11:0524-08-2021
[=10=]21:0624-08-2021
&0011:0724-08-2021
&0021:0924-08-2021
[=10=]31:3124-08-2021
&0031:3224-08-2021
[=10=]41:3924-08-2021
&0041:3924-08-2021
[=10=]12:3124-08-2021
&0012:3324-08-2021"""

df = pd.DataFrame(process_lines(txt.splitlines()), columns=["A", "B"])
print(df)

打印：

                    A                   B
0  [=11=]11:0524-08-2021  &0011:0724-08-2021
1  [=11=]21:0624-08-2021  &0021:0924-08-2021
2  [=11=]31:3124-08-2021  &0031:3224-08-2021
3  [=11=]41:3924-08-2021  &0041:3924-08-2021
4  [=11=]12:3124-08-2021  &0012:3324-08-2021

从原始输入创建 DataFrame

Create DataFrame from raw input

python

preprocessor

dataframe