使用 Pandas 清理 Excel 文件

Cleaning Excel File Using Pandas

我有一个 excel 文件,我使用 Pandas 读取了它,输出如下:

+------------+-----+-----+-----+
| Type       | 1   | 2   | 3   |
| Category   | A   | A   | C   |
| Dates      | NaN | NaN | NaN |
| 01/01/2021 | 12  | 12  | 9   |
| 02/01/2021 | 10  | 10  | 2   |
| 03/01/2021 | 30  | 16  | NaN |
| 04/01/2021 | 15  | 23  | 4   |
| 05/01/2021 | 14  | 20  | 5   |
+------------+-----+-----+-----+

前两行按列给出了每个时间序列的信息。所以对于 column 1Type1CategoryA。我想融化时间序列,但不太确定如何根据 sheet.

的结构解决问题

预期输出:

+------------+-------+----------+------+
|   Dates    | Price | Category | Type |
+------------+-------+----------+------+
| 01/01/2021 |    12 | A        |    1 |
| 02/01/2021 |    10 | A        |    1 |
| 03/01/2021 |    30 | A        |    1 |
| 04/01/2021 |    15 | A        |    1 |
| 05/01/2021 |    14 | A        |    1 |
| 01/01/2021 |    12 | B        |    2 |
| 02/01/2021 |    10 | B        |    2 |
| 03/01/2021 |    16 | B        |    2 |
| 04/01/2021 |    23 | B        |    2 |
| 05/01/2021 |    20 | B        |    2 |
| 01/01/2021 |     9 | C        |    3 |
| 02/01/2021 |     2 | C        |    3 |
| 04/01/2021 |     4 | C        |    3 |
| 05/01/2021 |     5 | C        |    3 |
+------------+-------+----------+------+

Type 3Category C 的情况下,因为值为 NaN 我们删除该日期。如何才能达到预期的输出?

假设以下输入数据帧:

         col0 col1 col2 col3
0        Type    1    2    3
1    Category    A    A    C
2       Dates  NaN  NaN  NaN
3  01/01/2021   12   12    9
4  02/01/2021   10   10    2
5  03/01/2021   30   16  NaN
6  04/01/2021   15   23    4
7  05/01/2021   14   20    5

这是一个工作流水线:

(df.iloc[3:]
   .set_index('col0').rename_axis('Date') # set first column aside
   # next 3 lines to rename columns index
   .T
   .set_index(pd.MultiIndex.from_arrays(df.iloc[:2, 1:].values, names=df.iloc[:2, 0]))
   .T
   .stack(level=[0,1]) # columns to rows
   .rename('Price')    # rename last unnamed column
   .reset_index()      # all indexes back to columns
)

输出:

          Date Type Category Price
0   01/01/2021    1        A    12
1   01/01/2021    2        A    12
2   01/01/2021    3        C     9
3   02/01/2021    1        A    10
4   02/01/2021    2        A    10
5   02/01/2021    3        C     2
6   03/01/2021    1        A    30
7   03/01/2021    2        A    16
8   04/01/2021    1        A    15
9   04/01/2021    2        A    23
10  04/01/2021    3        C     4
11  05/01/2021    1        A    14
12  05/01/2021    2        A    20
13  05/01/2021    3        C     5

另一个解决方案:

df = df.T
df.columns = df.loc["col0"]
df = df.iloc[1:]

print(
    df.melt(["Type", "Category", "Dates"])
    .drop(columns="Dates")
    .rename(columns={"col0": "Dates", "value": "Price"})
    .sort_values(by=["Type", "Category", "Dates"])
    .dropna()
    .reset_index(drop=True)
)

打印:

   Type Category       Dates Price
0     1        A  01/01/2021    12
1     1        A  02/01/2021    10
2     1        A  03/01/2021    30
3     1        A  04/01/2021    15
4     1        A  05/01/2021    14
5     2        B  01/01/2021    12
6     2        B  02/01/2021    10
7     2        B  03/01/2021    16
8     2        B  04/01/2021    23
9     2        B  05/01/2021    20
10    3        C  01/01/2021     9
11    3        C  02/01/2021     2
12    3        C  04/01/2021     4
13    3        C  05/01/2021     5

df 使用:

         col0 col1 col2 col3
0        Type    1    2    3
1    Category    A    B    C
2       Dates  NaN  NaN  NaN
3  01/01/2021   12   12    9
4  02/01/2021   10   10    2
5  03/01/2021   30   16  NaN
6  04/01/2021   15   23    4
7  05/01/2021   14   20    5