数据框清理
DataFrame cleaning
我有一个 excel 电子表格,导入后看起来类似于:
df = pd.DataFrame({
datetime(2021, 8, 1, 00, 00, 00): [120, np.nan, np.nan, np.nan, 300],
datetime(2021, 9, 1, 00, 00, 00): [np.nan, np.nan, 50, np.nan, np.nan],
datetime(2021, 10, 1, 00, 00, 00): [np.nan, 40, np.nan, 100, np.nan],
datetime(2021, 11, 1, 00, 00, 00): [80, np.nan, 50, np.nan, np.nan],
datetime(2021, 12, 1, 00, 00, 00): [np.nan, 20, np.nan, np.nan, np.nan]})
2021-08-01
2021-09-01
2021-10-01
2021-11-01
2021-12-01
120
NaN
NaN
80
NaN
NaN
NaN
40
NaN
20
NaN
50
NaN
50
NaN
NaN
NaN
100
NaN
NaN
300
NaN
NaN
NaN
NaN
我正在寻找(通过 python)将其转换成这样的东西:
shouldbe = pd.DataFrame({
"PayDate1":
[datetime(2021,8,1), datetime(2021,10,1), datetime(2021,9,1), datetime(2021,10,1), datetime(2021,8,1)],
"Amount1": [120, 40, 50, 100, 300],
"PayDate2":
[datetime(2021,11,1), datetime(2021,12,1), datetime(2021,11,1), '', ''],
"Amount2": [80, 20, 50, np.nan, np.nan]}))
PayDate1
Amount1
PayDate2
Amount2
2021-08-01
120
2021-11-01
80
2021-10-01
40
2021-12-01
20
2021-09-01
50
2021-11-01
50
2021-10-01
100
NaT
NaN
2021-08-01
300
NaT
NaN
我正在寻找一些如何实现这种转换的示例,在此先感谢您的帮助。
您可以使用 melt
、groupby
和 pivot
来获取预期的数据帧:
- 使用
melt
重塑您的数据框:
out = df.reset_index() \
.melt(id_vars='index', var_name='PayDate', value_name='Amount') \
.dropna()
print(out)
# Output
index PayDate Amount
0 0 2021-08-01 120.0 # <- index 0, 1st occurrence
4 4 2021-08-01 300.0 # <- index 4, 1st occurrence
7 2 2021-09-01 50.0 # <- index 2, 1st occurrence
11 1 2021-10-01 40.0 # <- index 1, 1st occurrence
13 3 2021-10-01 100.0 # <- index 3, 1st occurrence
15 0 2021-11-01 80.0 # <- index 0, 2nd occurrence
17 2 2021-11-01 50.0 # <- index 2, 2nd occurrence
21 1 2021-12-01 20.0 # <- index 1, 2nd occurrence
- 按
index
分组并应用 cumcount
创建新列的索引('1' 和 '2' 作为字符串供将来连接):
out['col'] = out.groupby('index').cumcount().add(1).astype(str)
print(out)
# Output:
index PayDate Amount col
0 0 2021-08-01 120.0 1
4 4 2021-08-01 300.0 1
7 2 2021-09-01 50.0 1
11 1 2021-10-01 40.0 1
13 3 2021-10-01 100.0 1
15 0 2021-11-01 80.0 2
17 2 2021-11-01 50.0 2
21 1 2021-12-01 20.0 2
- 旋转数据框
out = out.pivot(index='index', columns='col', values=['PayDate', 'Amount'])
print(out)
# Output
PayDate Amount
col 1 2 1 2
index
0 2021-08-01 2021-11-01 120.0 80.0
1 2021-10-01 2021-12-01 40.0 20.0
2 2021-09-01 2021-11-01 50.0 50.0
3 2021-10-01 NaT 100.0 NaN
4 2021-08-01 NaT 300.0 NaN
- 获取最终数据帧
cols = out.columns.get_level_values(1).argsort()
out.columns = out.columns.to_flat_index().map(''.join)
out.index.name = None
out = out[out.columns[cols]]
print(out)
PayDate1 Amount1 PayDate2 Amount2
0 2021-08-01 120.0 2021-11-01 80.0
1 2021-10-01 40.0 2021-12-01 20.0
2 2021-09-01 50.0 2021-11-01 50.0
3 2021-10-01 100.0 NaT NaN
4 2021-08-01 300.0 NaT NaN
纯粹为了完整性,这里是非pandas的方式:
import math
df = pd.DataFrame({
datetime(2021, 8, 1, 00, 00, 00): [120, np.nan, np.nan, np.nan, 300],
datetime(2021, 9, 1, 00, 00, 00): [np.nan, np.nan, 50, np.nan, np.nan],
datetime(2021, 10, 1, 00, 00, 00): [np.nan, 40, np.nan, 100, np.nan],
datetime(2021, 11, 1, 00, 00, 00): [80, np.nan, 50, np.nan, np.nan],
datetime(2021, 12, 1, 00, 00, 00): [np.nan, 20, np.nan, np.nan, np.nan]})
dates = df.columns
out = {k: [] for k in dates}
for row in df.iterrows():
for i, val in enumerate(row[1]):
d = dates[i]
if not math.isnan(val):
out[d].append(val)
print(out)
这不是 pandasy(实际上这里的最终输出甚至不是 pandas 数据帧,尽管将它转换回一个数据帧是微不足道的),但我声称它更容易阅读并因此更加Pythonic(TM)。更重要的是,它 可能 更适合某些用例。
我有一个 excel 电子表格,导入后看起来类似于:
df = pd.DataFrame({
datetime(2021, 8, 1, 00, 00, 00): [120, np.nan, np.nan, np.nan, 300],
datetime(2021, 9, 1, 00, 00, 00): [np.nan, np.nan, 50, np.nan, np.nan],
datetime(2021, 10, 1, 00, 00, 00): [np.nan, 40, np.nan, 100, np.nan],
datetime(2021, 11, 1, 00, 00, 00): [80, np.nan, 50, np.nan, np.nan],
datetime(2021, 12, 1, 00, 00, 00): [np.nan, 20, np.nan, np.nan, np.nan]})
2021-08-01 | 2021-09-01 | 2021-10-01 | 2021-11-01 | 2021-12-01 |
---|---|---|---|---|
120 | NaN | NaN | 80 | NaN |
NaN | NaN | 40 | NaN | 20 |
NaN | 50 | NaN | 50 | NaN |
NaN | NaN | 100 | NaN | NaN |
300 | NaN | NaN | NaN | NaN |
我正在寻找(通过 python)将其转换成这样的东西:
shouldbe = pd.DataFrame({
"PayDate1":
[datetime(2021,8,1), datetime(2021,10,1), datetime(2021,9,1), datetime(2021,10,1), datetime(2021,8,1)],
"Amount1": [120, 40, 50, 100, 300],
"PayDate2":
[datetime(2021,11,1), datetime(2021,12,1), datetime(2021,11,1), '', ''],
"Amount2": [80, 20, 50, np.nan, np.nan]}))
PayDate1 | Amount1 | PayDate2 | Amount2 |
---|---|---|---|
2021-08-01 | 120 | 2021-11-01 | 80 |
2021-10-01 | 40 | 2021-12-01 | 20 |
2021-09-01 | 50 | 2021-11-01 | 50 |
2021-10-01 | 100 | NaT | NaN |
2021-08-01 | 300 | NaT | NaN |
我正在寻找一些如何实现这种转换的示例,在此先感谢您的帮助。
您可以使用 melt
、groupby
和 pivot
来获取预期的数据帧:
- 使用
melt
重塑您的数据框:
out = df.reset_index() \
.melt(id_vars='index', var_name='PayDate', value_name='Amount') \
.dropna()
print(out)
# Output
index PayDate Amount
0 0 2021-08-01 120.0 # <- index 0, 1st occurrence
4 4 2021-08-01 300.0 # <- index 4, 1st occurrence
7 2 2021-09-01 50.0 # <- index 2, 1st occurrence
11 1 2021-10-01 40.0 # <- index 1, 1st occurrence
13 3 2021-10-01 100.0 # <- index 3, 1st occurrence
15 0 2021-11-01 80.0 # <- index 0, 2nd occurrence
17 2 2021-11-01 50.0 # <- index 2, 2nd occurrence
21 1 2021-12-01 20.0 # <- index 1, 2nd occurrence
- 按
index
分组并应用cumcount
创建新列的索引('1' 和 '2' 作为字符串供将来连接):
out['col'] = out.groupby('index').cumcount().add(1).astype(str)
print(out)
# Output:
index PayDate Amount col
0 0 2021-08-01 120.0 1
4 4 2021-08-01 300.0 1
7 2 2021-09-01 50.0 1
11 1 2021-10-01 40.0 1
13 3 2021-10-01 100.0 1
15 0 2021-11-01 80.0 2
17 2 2021-11-01 50.0 2
21 1 2021-12-01 20.0 2
- 旋转数据框
out = out.pivot(index='index', columns='col', values=['PayDate', 'Amount'])
print(out)
# Output
PayDate Amount
col 1 2 1 2
index
0 2021-08-01 2021-11-01 120.0 80.0
1 2021-10-01 2021-12-01 40.0 20.0
2 2021-09-01 2021-11-01 50.0 50.0
3 2021-10-01 NaT 100.0 NaN
4 2021-08-01 NaT 300.0 NaN
- 获取最终数据帧
cols = out.columns.get_level_values(1).argsort()
out.columns = out.columns.to_flat_index().map(''.join)
out.index.name = None
out = out[out.columns[cols]]
print(out)
PayDate1 Amount1 PayDate2 Amount2
0 2021-08-01 120.0 2021-11-01 80.0
1 2021-10-01 40.0 2021-12-01 20.0
2 2021-09-01 50.0 2021-11-01 50.0
3 2021-10-01 100.0 NaT NaN
4 2021-08-01 300.0 NaT NaN
纯粹为了完整性,这里是非pandas的方式:
import math
df = pd.DataFrame({
datetime(2021, 8, 1, 00, 00, 00): [120, np.nan, np.nan, np.nan, 300],
datetime(2021, 9, 1, 00, 00, 00): [np.nan, np.nan, 50, np.nan, np.nan],
datetime(2021, 10, 1, 00, 00, 00): [np.nan, 40, np.nan, 100, np.nan],
datetime(2021, 11, 1, 00, 00, 00): [80, np.nan, 50, np.nan, np.nan],
datetime(2021, 12, 1, 00, 00, 00): [np.nan, 20, np.nan, np.nan, np.nan]})
dates = df.columns
out = {k: [] for k in dates}
for row in df.iterrows():
for i, val in enumerate(row[1]):
d = dates[i]
if not math.isnan(val):
out[d].append(val)
print(out)
这不是 pandasy(实际上这里的最终输出甚至不是 pandas 数据帧,尽管将它转换回一个数据帧是微不足道的),但我声称它更容易阅读并因此更加Pythonic(TM)。更重要的是,它 可能 更适合某些用例。