如何在 Pandas 数据框上应用自连接
How to apply Self Join on Pandas Data Frame
我基本上无法加入带有 DataFrame 的 Pandas 系列。
让我们使用下面的代码生成一些虚拟数据
test = pd.DataFrame({'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05',
'2021-01-06', '2021-01-07', '2021-01-08', '2021-01-09', '2021-01-10',
'2021-01-11', '2021-01-12', '2021-01-13', '2021-01-14'],
'New_Date': [np.nan, '2021-01-01', '2021-01-01', '2021-01-04', '2021-01-03',
'2021-01-06', '2021-01-08', '2021-01-08', '2021-01-09', '2021-01-11',
'2021-01-11', '2021-01-13', '2021-01-13', np.nan],
'Price': [1, 1, 5, 3, 4, 3, 2, 5, 6, 4, 3, 2, 1, 7]})
test['Date'] = pd.to_datetime(test['Date'])
test['New_Date'] = pd.to_datetime(test['New_Date'])
test.set_index('Date', inplace=True)
实际 Df
+------------+------------+-------+
| Date | New_Date | Price |
+------------+------------+-------+
| 01/01/2021 | NaT | 1 |
| 02/01/2021 | 01/01/2021 | 1 |
| 03/01/2021 | 01/01/2021 | 5 |
| 04/01/2021 | 04/01/2021 | 3 |
| 05/01/2021 | 03/01/2021 | 4 |
| 06/01/2021 | 06/01/2021 | 3 |
| 07/01/2021 | 08/01/2021 | 2 |
| 08/01/2021 | 08/01/2021 | 5 |
| 09/01/2021 | 09/01/2021 | 6 |
| 10/01/2021 | 11/01/2021 | 4 |
| 11/01/2021 | 11/01/2021 | 3 |
| 12/01/2021 | 13/01/2021 | 2 |
| 13/01/2021 | 13/01/2021 | 1 |
| 14/01/2021 | NaT | 7 |
+------------+------------+-------+
期望输出
+------------+------------+-------+-----------+
| Date | New_Date | Price | New_Price |
+------------+------------+-------+-----------+
| 01/01/2021 | NaT | 1 | NaN |
| 02/01/2021 | 01/01/2021 | 1 | 1 |
| 03/01/2021 | 01/01/2021 | 5 | 1 |
| 04/01/2021 | 04/01/2021 | 3 | 3 |
| 05/01/2021 | 03/01/2021 | 4 | 5 |
| 06/01/2021 | 06/01/2021 | 3 | 3 |
| 07/01/2021 | 08/01/2021 | 2 | 5 |
| 08/01/2021 | 08/01/2021 | 5 | 5 |
| 09/01/2021 | 09/01/2021 | 6 | 6 |
| 10/01/2021 | 11/01/2021 | 4 | 3 |
| 11/01/2021 | 11/01/2021 | 3 | 3 |
| 12/01/2021 | 13/01/2021 | 2 | 1 |
| 13/01/2021 | 13/01/2021 | 1 | 1 |
| 14/01/2021 | NaT | 7 | NaN |
+------------+------------+-------+-----------+
我想通过使用 New_Date
作为索引创建一个列 New_Price
并加入 Date
得到 Price
,它将被命名为 New_Price
.
按照这个
我试过以下解决方案:
test['New_Price'] = test['Price'][test['New_Date']].values
上述解决方案因 NaT
而失败,这基本上不是真正的连接,所以我尝试了另一种方法
test.join(test.drop('New_Date', 1), on='New_Date', rsuffix='_y')
这解决了问题,因为我只需要将 Price_y
重命名为 New_Price
但是 如果测试 df 中有 20 列 我如何保留列从左边的 df 和只有价格列命名为 New_Price
,这将来自右边的 df。有什么优雅的方法可以做到这一点吗?
由于数据帧 test
已经将 Date
设置为索引,您可以轻松地将其用作映射 table 从索引 [=] 中查找 New Date
13=]。然后,根据New_Date
到Date
的匹配,我们可以得到对应的Price
为New_Price
(for New_Date
)。
这个可以使用Series.map()
来实现,如下:
test['New_Price'] = test['New_Date'].map(test['Price'])
结果:
print(test)
New_Date Price New_Price
Date
2021-01-01 NaT 1 NaN
2021-01-02 2021-01-01 1 1.0
2021-01-03 2021-01-01 5 1.0
2021-01-04 2021-01-04 3 3.0
2021-01-05 2021-01-03 4 5.0
2021-01-06 2021-01-06 3 3.0
2021-01-07 2021-01-08 2 5.0
2021-01-08 2021-01-08 5 5.0
2021-01-09 2021-01-09 6 6.0
2021-01-10 2021-01-11 4 3.0
2021-01-11 2021-01-11 3 3.0
2021-01-12 2021-01-13 2 1.0
2021-01-13 2021-01-13 1 1.0
2021-01-14 NaT 7 NaN
让我们尝试 join
仅在 Price
列上对齐 New_Date
:
new_df = test.join(test['Price'].rename('New_Price'), on='New_Date')
new_df
New_Date Price New_Price
Date
2021-01-01 NaT 1 NaN
2021-01-02 2021-01-01 1 1.0
2021-01-03 2021-01-01 5 1.0
2021-01-04 2021-01-04 3 3.0
2021-01-05 2021-01-03 4 5.0
2021-01-06 2021-01-06 3 3.0
2021-01-07 2021-01-08 2 5.0
2021-01-08 2021-01-08 5 5.0
2021-01-09 2021-01-09 6 6.0
2021-01-10 2021-01-11 4 3.0
2021-01-11 2021-01-11 3 3.0
2021-01-12 2021-01-13 2 1.0
2021-01-13 2021-01-13 1 1.0
2021-01-14 NaT 7 NaN
我基本上无法加入带有 DataFrame 的 Pandas 系列。
让我们使用下面的代码生成一些虚拟数据
test = pd.DataFrame({'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05',
'2021-01-06', '2021-01-07', '2021-01-08', '2021-01-09', '2021-01-10',
'2021-01-11', '2021-01-12', '2021-01-13', '2021-01-14'],
'New_Date': [np.nan, '2021-01-01', '2021-01-01', '2021-01-04', '2021-01-03',
'2021-01-06', '2021-01-08', '2021-01-08', '2021-01-09', '2021-01-11',
'2021-01-11', '2021-01-13', '2021-01-13', np.nan],
'Price': [1, 1, 5, 3, 4, 3, 2, 5, 6, 4, 3, 2, 1, 7]})
test['Date'] = pd.to_datetime(test['Date'])
test['New_Date'] = pd.to_datetime(test['New_Date'])
test.set_index('Date', inplace=True)
实际 Df
+------------+------------+-------+
| Date | New_Date | Price |
+------------+------------+-------+
| 01/01/2021 | NaT | 1 |
| 02/01/2021 | 01/01/2021 | 1 |
| 03/01/2021 | 01/01/2021 | 5 |
| 04/01/2021 | 04/01/2021 | 3 |
| 05/01/2021 | 03/01/2021 | 4 |
| 06/01/2021 | 06/01/2021 | 3 |
| 07/01/2021 | 08/01/2021 | 2 |
| 08/01/2021 | 08/01/2021 | 5 |
| 09/01/2021 | 09/01/2021 | 6 |
| 10/01/2021 | 11/01/2021 | 4 |
| 11/01/2021 | 11/01/2021 | 3 |
| 12/01/2021 | 13/01/2021 | 2 |
| 13/01/2021 | 13/01/2021 | 1 |
| 14/01/2021 | NaT | 7 |
+------------+------------+-------+
期望输出
+------------+------------+-------+-----------+
| Date | New_Date | Price | New_Price |
+------------+------------+-------+-----------+
| 01/01/2021 | NaT | 1 | NaN |
| 02/01/2021 | 01/01/2021 | 1 | 1 |
| 03/01/2021 | 01/01/2021 | 5 | 1 |
| 04/01/2021 | 04/01/2021 | 3 | 3 |
| 05/01/2021 | 03/01/2021 | 4 | 5 |
| 06/01/2021 | 06/01/2021 | 3 | 3 |
| 07/01/2021 | 08/01/2021 | 2 | 5 |
| 08/01/2021 | 08/01/2021 | 5 | 5 |
| 09/01/2021 | 09/01/2021 | 6 | 6 |
| 10/01/2021 | 11/01/2021 | 4 | 3 |
| 11/01/2021 | 11/01/2021 | 3 | 3 |
| 12/01/2021 | 13/01/2021 | 2 | 1 |
| 13/01/2021 | 13/01/2021 | 1 | 1 |
| 14/01/2021 | NaT | 7 | NaN |
+------------+------------+-------+-----------+
我想通过使用 New_Date
作为索引创建一个列 New_Price
并加入 Date
得到 Price
,它将被命名为 New_Price
.
按照这个
我试过以下解决方案:
test['New_Price'] = test['Price'][test['New_Date']].values
上述解决方案因 NaT
而失败,这基本上不是真正的连接,所以我尝试了另一种方法
test.join(test.drop('New_Date', 1), on='New_Date', rsuffix='_y')
这解决了问题,因为我只需要将 Price_y
重命名为 New_Price
但是 如果测试 df 中有 20 列 我如何保留列从左边的 df 和只有价格列命名为 New_Price
,这将来自右边的 df。有什么优雅的方法可以做到这一点吗?
由于数据帧 test
已经将 Date
设置为索引,您可以轻松地将其用作映射 table 从索引 [=] 中查找 New Date
13=]。然后,根据New_Date
到Date
的匹配,我们可以得到对应的Price
为New_Price
(for New_Date
)。
这个可以使用Series.map()
来实现,如下:
test['New_Price'] = test['New_Date'].map(test['Price'])
结果:
print(test)
New_Date Price New_Price
Date
2021-01-01 NaT 1 NaN
2021-01-02 2021-01-01 1 1.0
2021-01-03 2021-01-01 5 1.0
2021-01-04 2021-01-04 3 3.0
2021-01-05 2021-01-03 4 5.0
2021-01-06 2021-01-06 3 3.0
2021-01-07 2021-01-08 2 5.0
2021-01-08 2021-01-08 5 5.0
2021-01-09 2021-01-09 6 6.0
2021-01-10 2021-01-11 4 3.0
2021-01-11 2021-01-11 3 3.0
2021-01-12 2021-01-13 2 1.0
2021-01-13 2021-01-13 1 1.0
2021-01-14 NaT 7 NaN
让我们尝试 join
仅在 Price
列上对齐 New_Date
:
new_df = test.join(test['Price'].rename('New_Price'), on='New_Date')
new_df
New_Date Price New_Price
Date
2021-01-01 NaT 1 NaN
2021-01-02 2021-01-01 1 1.0
2021-01-03 2021-01-01 5 1.0
2021-01-04 2021-01-04 3 3.0
2021-01-05 2021-01-03 4 5.0
2021-01-06 2021-01-06 3 3.0
2021-01-07 2021-01-08 2 5.0
2021-01-08 2021-01-08 5 5.0
2021-01-09 2021-01-09 6 6.0
2021-01-10 2021-01-11 4 3.0
2021-01-11 2021-01-11 3 3.0
2021-01-12 2021-01-13 2 1.0
2021-01-13 2021-01-13 1 1.0
2021-01-14 NaT 7 NaN