根据列标签 DatetimeIndex 组合 DataFrame
Combining DataFrames based on column labels DatetimeIndex
我将天气数据存储在许多单独的文件中,其中列用于特定的测量仪器,每一行对应于特定日期的平均读数。假设一个文件如下所示:
first = pd.DataFrame(np.random.random((10,3)),
pd.date_range('1950-01-01', periods=10),
columns=['A','B','C'])
first
Out[21]:
A B C
1950-01-01 0.939932 0.504543 0.091025
1950-01-02 0.121418 0.725333 0.444813
1950-01-03 0.338385 0.783398 0.116468
1950-01-04 0.847905 0.846147 0.226074
1950-01-05 0.156315 0.704804 0.524886
1950-01-06 0.412284 0.425379 0.427246
1950-01-07 0.165859 0.406347 0.114586
1950-01-08 0.392670 0.789526 0.174001
1950-01-09 0.246180 0.776304 0.019368
1950-01-10 0.142213 0.731748 0.954076
第二个看起来像这样,
second = pd.DataFrame(np.random.random((10,3)),
pd.date_range('1950-01-11', periods=10),
columns=['A','B','D'])
second
Out[30]:
A B D
1950-01-11 0.190767 0.905640 0.325411
1950-01-12 0.109964 0.754694 0.414402
1950-01-13 0.058164 0.305405 0.768333
1950-01-14 0.267644 0.919876 0.631083
1950-01-15 0.981333 0.454678 0.533075
1950-01-16 0.831600 0.823845 0.980366
1950-01-17 0.303585 0.091634 0.338517
1950-01-18 0.723445 0.088020 0.570779
1950-01-19 0.639665 0.954577 0.763810
1950-01-20 0.370629 0.716066 0.628383
我想将这两个合并在一起,以便所有仪器(即 A、B、C、D...)都可以显示在具有所有测量时间段的同一文件中。预期结果如下所示:
A B C D
1950-01-01 0.939932 0.504543 0.091025
1950-01-02 0.121418 0.725333 0.444813
1950-01-03 0.338385 0.783398 0.116468
1950-01-04 0.847905 0.846147 0.226074
1950-01-05 0.156315 0.704804 0.524886
1950-01-06 0.412284 0.425379 0.427246
1950-01-07 0.165859 0.406347 0.114586
1950-01-08 0.392670 0.789526 0.174001
1950-01-09 0.246180 0.776304 0.019368
1950-01-10 0.142213 0.731748 0.954076
1950-01-11 0.190767 0.905640 0.325411
1950-01-12 0.109964 0.754694 0.414402
1950-01-13 0.058164 0.305405 0.768333
1950-01-14 0.267644 0.919876 0.631083
1950-01-15 0.981333 0.454678 0.533075
1950-01-16 0.831600 0.823845 0.980366
1950-01-17 0.303585 0.091634 0.338517
1950-01-18 0.723445 0.088020 0.570779
1950-01-19 0.639665 0.954577 0.763810
1950-01-20 0.370629 0.716066 0.628383
为了得到这个我试过:
first.merge(second, how='outer', left_index=True, right_index=True)
Out[34]:
A_x B_x C A_y B_y D
1950-01-01 0.939932 0.504543 0.091025 NaN NaN NaN
1950-01-02 0.121418 0.725333 0.444813 NaN NaN NaN
1950-01-03 0.338385 0.783398 0.116468 NaN NaN NaN
1950-01-04 0.847905 0.846147 0.226074 NaN NaN NaN
1950-01-05 0.156315 0.704804 0.524886 NaN NaN NaN
1950-01-06 0.412284 0.425379 0.427246 NaN NaN NaN
1950-01-07 0.165859 0.406347 0.114586 NaN NaN NaN
1950-01-08 0.392670 0.789526 0.174001 NaN NaN NaN
1950-01-09 0.246180 0.776304 0.019368 NaN NaN NaN
1950-01-10 0.142213 0.731748 0.954076 NaN NaN NaN
1950-01-11 NaN NaN NaN 0.190767 0.905640 0.325411
1950-01-12 NaN NaN NaN 0.109964 0.754694 0.414402
1950-01-13 NaN NaN NaN 0.058164 0.305405 0.768333
1950-01-14 NaN NaN NaN 0.267644 0.919876 0.631083
1950-01-15 NaN NaN NaN 0.981333 0.454678 0.533075
1950-01-16 NaN NaN NaN 0.831600 0.823845 0.980366
1950-01-17 NaN NaN NaN 0.303585 0.091634 0.338517
1950-01-18 NaN NaN NaN 0.723445 0.088020 0.570779
1950-01-19 NaN NaN NaN 0.639665 0.954577 0.763810
1950-01-20 NaN NaN NaN 0.370629 0.716066 0.628383
但是如您所见,需要合并的列已被拆分,因为没有公共行索引。我觉得这个功能将是 pandas 的一个非常有用的补充。这能做到吗?
假设 first
是 df1
并且 second
是 df2
,使用 concat
似乎可以解决您的问题。
>>> pd.concat([df1, df2])
A B C D
1950-01-01 0.939932 0.504543 0.091025 NaN
1950-01-02 0.121418 0.725333 0.444813 NaN
1950-01-03 0.338385 0.783398 0.116468 NaN
1950-01-04 0.847905 0.846147 0.226074 NaN
1950-01-05 0.156315 0.704804 0.524886 NaN
1950-01-06 0.412284 0.425379 0.427246 NaN
1950-01-07 0.165859 0.406347 0.114586 NaN
1950-01-08 0.392670 0.789526 0.174001 NaN
1950-01-09 0.246180 0.776304 0.019368 NaN
1950-01-10 0.142213 0.731748 0.954076 NaN
1950-01-11 0.190767 0.905640 NaN 0.325411
1950-01-12 0.109964 0.754694 NaN 0.414402
1950-01-13 0.058164 0.305405 NaN 0.768333
1950-01-14 0.267644 0.919876 NaN 0.631083
1950-01-15 0.981333 0.454678 NaN 0.533075
1950-01-16 0.831600 0.823845 NaN 0.980366
1950-01-17 0.303585 0.091634 NaN 0.338517
1950-01-18 0.723445 0.088020 NaN 0.570779
1950-01-19 0.639665 0.954577 NaN 0.763810
1950-01-20 0.370629 0.716066 NaN 0.628383
另一种方法是使用 .combine
函数,它将结果的形状更改为两个轴上的并集。
combiner = lambda x, y: np.where(pd.isnull(x), y, x)
first.combine(second, combiner)
A B C D
1950-01-01 0.7917 0.5289 0.5680 NaN
1950-01-02 0.9256 0.0710 0.0871 NaN
1950-01-03 0.0202 0.8326 0.7782 NaN
1950-01-04 0.8700 0.9786 0.7992 NaN
1950-01-05 0.4615 0.7805 0.1183 NaN
1950-01-06 0.6399 0.1434 0.9447 NaN
1950-01-07 0.5218 0.4147 0.2646 NaN
1950-01-08 0.7742 0.4562 0.5684 NaN
1950-01-09 0.0188 0.6176 0.6121 NaN
1950-01-10 0.6169 0.9437 0.6818 NaN
1950-01-11 0.3595 0.4370 NaN 0.6976
1950-01-12 0.0602 0.6668 NaN 0.6706
1950-01-13 0.2104 0.1289 NaN 0.3154
1950-01-14 0.3637 0.5702 NaN 0.4386
1950-01-15 0.9884 0.1020 NaN 0.2089
1950-01-16 0.1613 0.6531 NaN 0.2533
1950-01-17 0.4663 0.2444 NaN 0.1590
1950-01-18 0.1104 0.6563 NaN 0.1382
1950-01-19 0.1966 0.3687 NaN 0.8210
1950-01-20 0.0971 0.8379 NaN 0.0961
我将天气数据存储在许多单独的文件中,其中列用于特定的测量仪器,每一行对应于特定日期的平均读数。假设一个文件如下所示:
first = pd.DataFrame(np.random.random((10,3)),
pd.date_range('1950-01-01', periods=10),
columns=['A','B','C'])
first
Out[21]:
A B C
1950-01-01 0.939932 0.504543 0.091025
1950-01-02 0.121418 0.725333 0.444813
1950-01-03 0.338385 0.783398 0.116468
1950-01-04 0.847905 0.846147 0.226074
1950-01-05 0.156315 0.704804 0.524886
1950-01-06 0.412284 0.425379 0.427246
1950-01-07 0.165859 0.406347 0.114586
1950-01-08 0.392670 0.789526 0.174001
1950-01-09 0.246180 0.776304 0.019368
1950-01-10 0.142213 0.731748 0.954076
第二个看起来像这样,
second = pd.DataFrame(np.random.random((10,3)),
pd.date_range('1950-01-11', periods=10),
columns=['A','B','D'])
second
Out[30]:
A B D
1950-01-11 0.190767 0.905640 0.325411
1950-01-12 0.109964 0.754694 0.414402
1950-01-13 0.058164 0.305405 0.768333
1950-01-14 0.267644 0.919876 0.631083
1950-01-15 0.981333 0.454678 0.533075
1950-01-16 0.831600 0.823845 0.980366
1950-01-17 0.303585 0.091634 0.338517
1950-01-18 0.723445 0.088020 0.570779
1950-01-19 0.639665 0.954577 0.763810
1950-01-20 0.370629 0.716066 0.628383
我想将这两个合并在一起,以便所有仪器(即 A、B、C、D...)都可以显示在具有所有测量时间段的同一文件中。预期结果如下所示:
A B C D
1950-01-01 0.939932 0.504543 0.091025
1950-01-02 0.121418 0.725333 0.444813
1950-01-03 0.338385 0.783398 0.116468
1950-01-04 0.847905 0.846147 0.226074
1950-01-05 0.156315 0.704804 0.524886
1950-01-06 0.412284 0.425379 0.427246
1950-01-07 0.165859 0.406347 0.114586
1950-01-08 0.392670 0.789526 0.174001
1950-01-09 0.246180 0.776304 0.019368
1950-01-10 0.142213 0.731748 0.954076
1950-01-11 0.190767 0.905640 0.325411
1950-01-12 0.109964 0.754694 0.414402
1950-01-13 0.058164 0.305405 0.768333
1950-01-14 0.267644 0.919876 0.631083
1950-01-15 0.981333 0.454678 0.533075
1950-01-16 0.831600 0.823845 0.980366
1950-01-17 0.303585 0.091634 0.338517
1950-01-18 0.723445 0.088020 0.570779
1950-01-19 0.639665 0.954577 0.763810
1950-01-20 0.370629 0.716066 0.628383
为了得到这个我试过:
first.merge(second, how='outer', left_index=True, right_index=True)
Out[34]:
A_x B_x C A_y B_y D
1950-01-01 0.939932 0.504543 0.091025 NaN NaN NaN
1950-01-02 0.121418 0.725333 0.444813 NaN NaN NaN
1950-01-03 0.338385 0.783398 0.116468 NaN NaN NaN
1950-01-04 0.847905 0.846147 0.226074 NaN NaN NaN
1950-01-05 0.156315 0.704804 0.524886 NaN NaN NaN
1950-01-06 0.412284 0.425379 0.427246 NaN NaN NaN
1950-01-07 0.165859 0.406347 0.114586 NaN NaN NaN
1950-01-08 0.392670 0.789526 0.174001 NaN NaN NaN
1950-01-09 0.246180 0.776304 0.019368 NaN NaN NaN
1950-01-10 0.142213 0.731748 0.954076 NaN NaN NaN
1950-01-11 NaN NaN NaN 0.190767 0.905640 0.325411
1950-01-12 NaN NaN NaN 0.109964 0.754694 0.414402
1950-01-13 NaN NaN NaN 0.058164 0.305405 0.768333
1950-01-14 NaN NaN NaN 0.267644 0.919876 0.631083
1950-01-15 NaN NaN NaN 0.981333 0.454678 0.533075
1950-01-16 NaN NaN NaN 0.831600 0.823845 0.980366
1950-01-17 NaN NaN NaN 0.303585 0.091634 0.338517
1950-01-18 NaN NaN NaN 0.723445 0.088020 0.570779
1950-01-19 NaN NaN NaN 0.639665 0.954577 0.763810
1950-01-20 NaN NaN NaN 0.370629 0.716066 0.628383
但是如您所见,需要合并的列已被拆分,因为没有公共行索引。我觉得这个功能将是 pandas 的一个非常有用的补充。这能做到吗?
假设 first
是 df1
并且 second
是 df2
,使用 concat
似乎可以解决您的问题。
>>> pd.concat([df1, df2])
A B C D
1950-01-01 0.939932 0.504543 0.091025 NaN
1950-01-02 0.121418 0.725333 0.444813 NaN
1950-01-03 0.338385 0.783398 0.116468 NaN
1950-01-04 0.847905 0.846147 0.226074 NaN
1950-01-05 0.156315 0.704804 0.524886 NaN
1950-01-06 0.412284 0.425379 0.427246 NaN
1950-01-07 0.165859 0.406347 0.114586 NaN
1950-01-08 0.392670 0.789526 0.174001 NaN
1950-01-09 0.246180 0.776304 0.019368 NaN
1950-01-10 0.142213 0.731748 0.954076 NaN
1950-01-11 0.190767 0.905640 NaN 0.325411
1950-01-12 0.109964 0.754694 NaN 0.414402
1950-01-13 0.058164 0.305405 NaN 0.768333
1950-01-14 0.267644 0.919876 NaN 0.631083
1950-01-15 0.981333 0.454678 NaN 0.533075
1950-01-16 0.831600 0.823845 NaN 0.980366
1950-01-17 0.303585 0.091634 NaN 0.338517
1950-01-18 0.723445 0.088020 NaN 0.570779
1950-01-19 0.639665 0.954577 NaN 0.763810
1950-01-20 0.370629 0.716066 NaN 0.628383
另一种方法是使用 .combine
函数,它将结果的形状更改为两个轴上的并集。
combiner = lambda x, y: np.where(pd.isnull(x), y, x)
first.combine(second, combiner)
A B C D
1950-01-01 0.7917 0.5289 0.5680 NaN
1950-01-02 0.9256 0.0710 0.0871 NaN
1950-01-03 0.0202 0.8326 0.7782 NaN
1950-01-04 0.8700 0.9786 0.7992 NaN
1950-01-05 0.4615 0.7805 0.1183 NaN
1950-01-06 0.6399 0.1434 0.9447 NaN
1950-01-07 0.5218 0.4147 0.2646 NaN
1950-01-08 0.7742 0.4562 0.5684 NaN
1950-01-09 0.0188 0.6176 0.6121 NaN
1950-01-10 0.6169 0.9437 0.6818 NaN
1950-01-11 0.3595 0.4370 NaN 0.6976
1950-01-12 0.0602 0.6668 NaN 0.6706
1950-01-13 0.2104 0.1289 NaN 0.3154
1950-01-14 0.3637 0.5702 NaN 0.4386
1950-01-15 0.9884 0.1020 NaN 0.2089
1950-01-16 0.1613 0.6531 NaN 0.2533
1950-01-17 0.4663 0.2444 NaN 0.1590
1950-01-18 0.1104 0.6563 NaN 0.1382
1950-01-19 0.1966 0.3687 NaN 0.8210
1950-01-20 0.0971 0.8379 NaN 0.0961