python 从没有可识别的行和列的 txt 文件创建列和行

Question

Link to txt file example Additional File Link

我希望能够使用 python 读取 txt 文件和 select 列和行信息。文件中的 txt 位于一个单独的列中，该列的数据位于文件的下方，如下例所示。我在 () 中添加了一些附加信息。我有几个格式相似的不同 txt 文件，所以我希望能够只替换每个文件并运行它们。

我的目标是能够将数据放入 csv 或 excel 文件中。

这是一个 link 的 txt 文件。大约向下 1/3，您将在 () 中看到列名所在的位置，并且就在我标记了这些列的数据位置的正下方。 sheet 下方还有其他数据，但我只需要一个好的开始。我有几个类似的 txt 文件，我需要这些文件才能运行打开代码。

Answer 1

在几乎是自由格式的数据接口上执行 ETL 总是丑陋且不稳定方法

将列定义视为元数据
找到元数据，得到开始行和结束行
清理元数据 - 有不存在的列定义！
现在有列名
将最后一个元数据行之后的其余行视为实际数据
reshape() 将其放入数组 cols X 剩余数据中的最大可能行数
执行 post 转换验证逻辑以筛选出看起来有趣的数据集

import pandas as pd
with open("data1.txt") as f:
    data = f.read()
# pd.read_csv(io.StringIO(data))
dfraw = pd.DataFrame(data.split("\n"))

# find first and last row that contain column meta information
colsi = dfraw.loc[dfraw[0].str.contains("Column")].index
assert len(colsi)==2, "failed to find column labels"

# first stab... based on finding meta information
cols = dfraw.loc[colsi[0]:colsi[1]]
# exclude meta rows that are empty or labeled "Grade"
cols = cols[(cols[0]!="") & (cols[0]!="Grade") & (cols[0]!="(Column Names)--------------------")]

# shift dfraw start row to start of data after column meta data
dfraw = dfraw.loc[colsi[1]+1:].reset_index(drop=True)
# create dataframe with all remaining rows, truncating so reshape does not fail
# using numpy reshape() to transform sets of rows into columns
df = pd.DataFrame(dfraw.loc[:(len(dfraw)-(len(dfraw)%len(cols)))-1].values.reshape(int(len(dfraw)/len(cols)), len(cols)),
            columns=cols[0].values)

# filter to valid "Egg #" and Day0 is defined
df = df[(df["Egg #"].str.match("^[0-9]+[ ]+[A-Z][0-9]")) & (df["Day0"].str.strip()!="")]

print(df.to_string(index=False))

输出

                      Egg #  Day0 Maturity    IVF/ICSI                        Day1                   Day2 Grade                     Day 3                 Grade- Frag%                   Day4                   Day5                   Day6                   Day7                  Comment         Fate Freezing ID    (End of Column names)-------------------------
 1   A1 (egg #)------------  MII                  ICSI  2PN/2PB  (Day 1)----------  2BC   (Day 2 grade)--------  6BC   (Day 3)-----------  15   (Grade- Frag%)--------  -    (Day 4) --------  -    (Day 5) --------  -    (Day 6) --------  -    (Day 7) --------       (Comments) -------  F    (Fate)                       A102519-01EV (Freezing ID)-------------
                     2   A2   MII                 ICSI                     2PN/2PB                           5B                       12B                           10                      -                      -                      -                      -                                     F                                                  B102519-01EV
                     3   A3   MII                 ICSI                     2PN/2PB                           5B                        8B                           10                      -                      -                      -                      -                                    ET                                                              
                     4   A4   MII                 ICSI                     2PN/2PB                           5B                       10A                            0                      -                      -                      -                      -                                    ET                                                              
                     5   B1   MII                 ICSI                     2PN/2PB                          5BC                       8BC                           10                      -                      -                      -                      -                                     F                                                  A102519-01EV
                     6   B2    MI                                         MII/ICSI                      2PN/2PB                      SAME                            -                   SAME                      -                      -                      -                                     D

python 从没有可识别的行和列的 txt 文件创建列和行

python creating columns and rows from a txt file without recognizable rows and columns

python

excel

pandas

jupyter-notebook

txt

输出