如何在 Pandas 数据框上应用自连接

How to apply Self Join on Pandas Data Frame

我基本上无法加入带有 DataFrame 的 Pandas 系列。

让我们使用下面的代码生成一些虚拟数据

test = pd.DataFrame({'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05',
                              '2021-01-06', '2021-01-07', '2021-01-08', '2021-01-09', '2021-01-10',
                              '2021-01-11', '2021-01-12', '2021-01-13', '2021-01-14'],
                     'New_Date': [np.nan, '2021-01-01', '2021-01-01', '2021-01-04', '2021-01-03',
                                    '2021-01-06', '2021-01-08', '2021-01-08', '2021-01-09', '2021-01-11',
                                    '2021-01-11', '2021-01-13', '2021-01-13', np.nan],
                     'Price': [1, 1, 5, 3, 4, 3, 2, 5, 6, 4, 3, 2, 1, 7]})
test['Date'] = pd.to_datetime(test['Date'])
test['New_Date'] = pd.to_datetime(test['New_Date'])
test.set_index('Date', inplace=True)

实际 Df

+------------+------------+-------+
|    Date    |  New_Date  | Price |
+------------+------------+-------+
| 01/01/2021 | NaT        |     1 |
| 02/01/2021 | 01/01/2021 |     1 |
| 03/01/2021 | 01/01/2021 |     5 |
| 04/01/2021 | 04/01/2021 |     3 |
| 05/01/2021 | 03/01/2021 |     4 |
| 06/01/2021 | 06/01/2021 |     3 |
| 07/01/2021 | 08/01/2021 |     2 |
| 08/01/2021 | 08/01/2021 |     5 |
| 09/01/2021 | 09/01/2021 |     6 |
| 10/01/2021 | 11/01/2021 |     4 |
| 11/01/2021 | 11/01/2021 |     3 |
| 12/01/2021 | 13/01/2021 |     2 |
| 13/01/2021 | 13/01/2021 |     1 |
| 14/01/2021 | NaT        |     7 |
+------------+------------+-------+

期望输出

+------------+------------+-------+-----------+
|    Date    |  New_Date  | Price | New_Price |
+------------+------------+-------+-----------+
| 01/01/2021 | NaT        |     1 | NaN       |
| 02/01/2021 | 01/01/2021 |     1 | 1         |
| 03/01/2021 | 01/01/2021 |     5 | 1         |
| 04/01/2021 | 04/01/2021 |     3 | 3         |
| 05/01/2021 | 03/01/2021 |     4 | 5         |
| 06/01/2021 | 06/01/2021 |     3 | 3         |
| 07/01/2021 | 08/01/2021 |     2 | 5         |
| 08/01/2021 | 08/01/2021 |     5 | 5         |
| 09/01/2021 | 09/01/2021 |     6 | 6         |
| 10/01/2021 | 11/01/2021 |     4 | 3         |
| 11/01/2021 | 11/01/2021 |     3 | 3         |
| 12/01/2021 | 13/01/2021 |     2 | 1         |
| 13/01/2021 | 13/01/2021 |     1 | 1         |
| 14/01/2021 | NaT        |     7 | NaN       |
+------------+------------+-------+-----------+

我想通过使用 New_Date 作为索引创建一个列 New_Price 并加入 Date 得到 Price ,它将被命名为 New_Price.

按照这个

我试过以下解决方案:

test['New_Price'] = test['Price'][test['New_Date']].values

上述解决方案因 NaT 而失败,这基本上不是真正的连接,所以我尝试了另一种方法

test.join(test.drop('New_Date', 1), on='New_Date', rsuffix='_y')

这解决了问题,因为我只需要将 Price_y 重命名为 New_Price 但是 如果测试 df 中有 20 列 我如何保留列从左边的 df 和只有价格列命名为 New_Price,这将来自右边的 df。有什么优雅的方法可以做到这一点吗?

由于数据帧 test 已经将 Date 设置为索引,您可以轻松地将其用作映射 table 从索引 [=] 中查找 New Date 13=]。然后,根据New_DateDate的匹配,我们可以得到对应的PriceNew_Price(for New_Date)。

这个可以使用Series.map()来实现,如下:

test['New_Price'] = test['New_Date'].map(test['Price'])

结果:

print(test)

             New_Date  Price  New_Price
Date                                   
2021-01-01        NaT      1        NaN
2021-01-02 2021-01-01      1        1.0
2021-01-03 2021-01-01      5        1.0
2021-01-04 2021-01-04      3        3.0
2021-01-05 2021-01-03      4        5.0
2021-01-06 2021-01-06      3        3.0
2021-01-07 2021-01-08      2        5.0
2021-01-08 2021-01-08      5        5.0
2021-01-09 2021-01-09      6        6.0
2021-01-10 2021-01-11      4        3.0
2021-01-11 2021-01-11      3        3.0
2021-01-12 2021-01-13      2        1.0
2021-01-13 2021-01-13      1        1.0
2021-01-14        NaT      7        NaN

让我们尝试 join 仅在 Price 列上对齐 New_Date:

new_df = test.join(test['Price'].rename('New_Price'), on='New_Date')

new_df

             New_Date  Price  New_Price
Date                                   
2021-01-01        NaT      1        NaN
2021-01-02 2021-01-01      1        1.0
2021-01-03 2021-01-01      5        1.0
2021-01-04 2021-01-04      3        3.0
2021-01-05 2021-01-03      4        5.0
2021-01-06 2021-01-06      3        3.0
2021-01-07 2021-01-08      2        5.0
2021-01-08 2021-01-08      5        5.0
2021-01-09 2021-01-09      6        6.0
2021-01-10 2021-01-11      4        3.0
2021-01-11 2021-01-11      3        3.0
2021-01-12 2021-01-13      2        1.0
2021-01-13 2021-01-13      1        1.0
2021-01-14        NaT      7        NaN