使用 Pandas 清理 Excel 文件
Cleaning Excel File Using Pandas
我有一个 excel 文件,我使用 Pandas 读取了它,输出如下:
+------------+-----+-----+-----+
| Type | 1 | 2 | 3 |
| Category | A | A | C |
| Dates | NaN | NaN | NaN |
| 01/01/2021 | 12 | 12 | 9 |
| 02/01/2021 | 10 | 10 | 2 |
| 03/01/2021 | 30 | 16 | NaN |
| 04/01/2021 | 15 | 23 | 4 |
| 05/01/2021 | 14 | 20 | 5 |
+------------+-----+-----+-----+
前两行按列给出了每个时间序列的信息。所以对于 column 1
,Type
是 1
,Category
是 A
。我想融化时间序列,但不太确定如何根据 sheet.
的结构解决问题
预期输出:
+------------+-------+----------+------+
| Dates | Price | Category | Type |
+------------+-------+----------+------+
| 01/01/2021 | 12 | A | 1 |
| 02/01/2021 | 10 | A | 1 |
| 03/01/2021 | 30 | A | 1 |
| 04/01/2021 | 15 | A | 1 |
| 05/01/2021 | 14 | A | 1 |
| 01/01/2021 | 12 | B | 2 |
| 02/01/2021 | 10 | B | 2 |
| 03/01/2021 | 16 | B | 2 |
| 04/01/2021 | 23 | B | 2 |
| 05/01/2021 | 20 | B | 2 |
| 01/01/2021 | 9 | C | 3 |
| 02/01/2021 | 2 | C | 3 |
| 04/01/2021 | 4 | C | 3 |
| 05/01/2021 | 5 | C | 3 |
+------------+-------+----------+------+
在 Type 3
和 Category C
的情况下,因为值为 NaN
我们删除该日期。如何才能达到预期的输出?
假设以下输入数据帧:
col0 col1 col2 col3
0 Type 1 2 3
1 Category A A C
2 Dates NaN NaN NaN
3 01/01/2021 12 12 9
4 02/01/2021 10 10 2
5 03/01/2021 30 16 NaN
6 04/01/2021 15 23 4
7 05/01/2021 14 20 5
这是一个工作流水线:
(df.iloc[3:]
.set_index('col0').rename_axis('Date') # set first column aside
# next 3 lines to rename columns index
.T
.set_index(pd.MultiIndex.from_arrays(df.iloc[:2, 1:].values, names=df.iloc[:2, 0]))
.T
.stack(level=[0,1]) # columns to rows
.rename('Price') # rename last unnamed column
.reset_index() # all indexes back to columns
)
输出:
Date Type Category Price
0 01/01/2021 1 A 12
1 01/01/2021 2 A 12
2 01/01/2021 3 C 9
3 02/01/2021 1 A 10
4 02/01/2021 2 A 10
5 02/01/2021 3 C 2
6 03/01/2021 1 A 30
7 03/01/2021 2 A 16
8 04/01/2021 1 A 15
9 04/01/2021 2 A 23
10 04/01/2021 3 C 4
11 05/01/2021 1 A 14
12 05/01/2021 2 A 20
13 05/01/2021 3 C 5
另一个解决方案:
df = df.T
df.columns = df.loc["col0"]
df = df.iloc[1:]
print(
df.melt(["Type", "Category", "Dates"])
.drop(columns="Dates")
.rename(columns={"col0": "Dates", "value": "Price"})
.sort_values(by=["Type", "Category", "Dates"])
.dropna()
.reset_index(drop=True)
)
打印:
Type Category Dates Price
0 1 A 01/01/2021 12
1 1 A 02/01/2021 10
2 1 A 03/01/2021 30
3 1 A 04/01/2021 15
4 1 A 05/01/2021 14
5 2 B 01/01/2021 12
6 2 B 02/01/2021 10
7 2 B 03/01/2021 16
8 2 B 04/01/2021 23
9 2 B 05/01/2021 20
10 3 C 01/01/2021 9
11 3 C 02/01/2021 2
12 3 C 04/01/2021 4
13 3 C 05/01/2021 5
df
使用:
col0 col1 col2 col3
0 Type 1 2 3
1 Category A B C
2 Dates NaN NaN NaN
3 01/01/2021 12 12 9
4 02/01/2021 10 10 2
5 03/01/2021 30 16 NaN
6 04/01/2021 15 23 4
7 05/01/2021 14 20 5
我有一个 excel 文件,我使用 Pandas 读取了它,输出如下:
+------------+-----+-----+-----+
| Type | 1 | 2 | 3 |
| Category | A | A | C |
| Dates | NaN | NaN | NaN |
| 01/01/2021 | 12 | 12 | 9 |
| 02/01/2021 | 10 | 10 | 2 |
| 03/01/2021 | 30 | 16 | NaN |
| 04/01/2021 | 15 | 23 | 4 |
| 05/01/2021 | 14 | 20 | 5 |
+------------+-----+-----+-----+
前两行按列给出了每个时间序列的信息。所以对于 column 1
,Type
是 1
,Category
是 A
。我想融化时间序列,但不太确定如何根据 sheet.
预期输出:
+------------+-------+----------+------+
| Dates | Price | Category | Type |
+------------+-------+----------+------+
| 01/01/2021 | 12 | A | 1 |
| 02/01/2021 | 10 | A | 1 |
| 03/01/2021 | 30 | A | 1 |
| 04/01/2021 | 15 | A | 1 |
| 05/01/2021 | 14 | A | 1 |
| 01/01/2021 | 12 | B | 2 |
| 02/01/2021 | 10 | B | 2 |
| 03/01/2021 | 16 | B | 2 |
| 04/01/2021 | 23 | B | 2 |
| 05/01/2021 | 20 | B | 2 |
| 01/01/2021 | 9 | C | 3 |
| 02/01/2021 | 2 | C | 3 |
| 04/01/2021 | 4 | C | 3 |
| 05/01/2021 | 5 | C | 3 |
+------------+-------+----------+------+
在 Type 3
和 Category C
的情况下,因为值为 NaN
我们删除该日期。如何才能达到预期的输出?
假设以下输入数据帧:
col0 col1 col2 col3
0 Type 1 2 3
1 Category A A C
2 Dates NaN NaN NaN
3 01/01/2021 12 12 9
4 02/01/2021 10 10 2
5 03/01/2021 30 16 NaN
6 04/01/2021 15 23 4
7 05/01/2021 14 20 5
这是一个工作流水线:
(df.iloc[3:]
.set_index('col0').rename_axis('Date') # set first column aside
# next 3 lines to rename columns index
.T
.set_index(pd.MultiIndex.from_arrays(df.iloc[:2, 1:].values, names=df.iloc[:2, 0]))
.T
.stack(level=[0,1]) # columns to rows
.rename('Price') # rename last unnamed column
.reset_index() # all indexes back to columns
)
输出:
Date Type Category Price
0 01/01/2021 1 A 12
1 01/01/2021 2 A 12
2 01/01/2021 3 C 9
3 02/01/2021 1 A 10
4 02/01/2021 2 A 10
5 02/01/2021 3 C 2
6 03/01/2021 1 A 30
7 03/01/2021 2 A 16
8 04/01/2021 1 A 15
9 04/01/2021 2 A 23
10 04/01/2021 3 C 4
11 05/01/2021 1 A 14
12 05/01/2021 2 A 20
13 05/01/2021 3 C 5
另一个解决方案:
df = df.T
df.columns = df.loc["col0"]
df = df.iloc[1:]
print(
df.melt(["Type", "Category", "Dates"])
.drop(columns="Dates")
.rename(columns={"col0": "Dates", "value": "Price"})
.sort_values(by=["Type", "Category", "Dates"])
.dropna()
.reset_index(drop=True)
)
打印:
Type Category Dates Price
0 1 A 01/01/2021 12
1 1 A 02/01/2021 10
2 1 A 03/01/2021 30
3 1 A 04/01/2021 15
4 1 A 05/01/2021 14
5 2 B 01/01/2021 12
6 2 B 02/01/2021 10
7 2 B 03/01/2021 16
8 2 B 04/01/2021 23
9 2 B 05/01/2021 20
10 3 C 01/01/2021 9
11 3 C 02/01/2021 2
12 3 C 04/01/2021 4
13 3 C 05/01/2021 5
df
使用:
col0 col1 col2 col3
0 Type 1 2 3
1 Category A B C
2 Dates NaN NaN NaN
3 01/01/2021 12 12 9
4 02/01/2021 10 10 2
5 03/01/2021 30 16 NaN
6 04/01/2021 15 23 4
7 05/01/2021 14 20 5