如何在日期和列组合上合并两个面板数据集？

Question

我有两个数据集，df1 &df2，看起来像这样：

df1:

Date	Code	City	State	Population	Cases	Deaths
2020-03	10001	Los Angeles	CA	5000000	122	12
2020-03	10002	Sacramento	CA	5400000	120	2
2020-03	12223	Houston	TX	3500000	23	11
...	...	...	...	...	...	...
2021-07	10001	Los Angeles	CA	5000002	12220	2200
2021-07	10002	Sacramento	CA	5444000	211	22
2021-07	12223	Houston	TX	4443300	2111	330

df2:

Date	Code	City	State	Quantity x	Quantity y
2019-01	100015	LOS ANGELES	CA		445
2019-01	100015	LOS ANGELES	CA	330
2019-01	100023	SACRAMENTO	CA	4450	566
2019-01	1222393	HOUSTON	TX	440	NA
...	...	...	...	...	...
2021-07	100015	LOS ANGELES	CA	31113	3455
2021-07	100023	SACRAMENTO	CA	3220	NA
2021-07	1222393	HOUSTON	TX	NA	3200

如您所见，df2 在 df1 之前开始，但它们都在同一日期结束。另外，df1和df2的ID有一些共性但不相等（一般来说，df2比df1多一到两位数）。

另请注意，df2 的同一日期可能有多个条目，但数量不同。

我想合并这两个，更具体地说，我想在 df2 上合并 df1，这样它从 2019-01 开始，到 2021-07 结束。在这种情况下，Cases 和 Deaths 将在 2019-01 和 2020-02 之间为 0。

如何使用日期、城市和州（假设某些城市具有相同名称但处于不同州）并使用上述日期规范将 df1 合并到 df2？我希望我的组合数据框 df3 看起来像这样：

Date	Code	City	State	Quantity x	Quantity y	Population	Cases	Deaths
2019-01	10001	Los Angeles	CA		445		0	0
2019-01	10002	Sacramento	CA	4450	556		0	0
2020-03	12223	Houston	TX	440	4440	35000000	23	11
...	...	...	...	...	...	...	...	...
2021-07	10002	Sacramento	CA	3220	NA	5444000	211	22

编辑：我在想：也许我可以先削减 df1 的日期范围，然后再进行合并。我不确定外部合并如何处理不一定重叠的日期。也许有人有更好的主意。

Answer 1

听起来您正在寻找 pd.DataFrame.merge and pd.DataFrame.join 中的 how 关键字参数。

这是一个示例：

import pandas as pd

df1 = pd.read_json(
    '{"Date":{"0":1583020800000,"1":1583020800000,"2":1583020800000,"3":1625097600000,"4":1625097600000,"5":1625097600000},"City":{"0":"Los Angeles","1":"Sacramento","2":"Houston","3":"Los Angeles","4":"Sacramento","5":"Houston"},"State":{"0":"CA","1":"CA","2":"TX","3":"CA","4":"CA","5":"TX"},"Population":{"0":5000000,"1":5400000,"2":3500000,"3":5000002,"4":5444000,"5":4443300},"Cases":{"0":122,"1":120,"2":23,"3":12220,"4":211,"5":2111},"Deaths":{"0":12,"1":2,"2":11,"3":2200,"4":22,"5":330}}'
)
df2 = pd.read_json(
    '{"Date":{"0":1546300800000,"1":1546300800000,"2":1546300800000,"3":1546300800000,"4":1625097600000,"5":1625097600000,"6":1625097600000},"City":{"0":"LOS ANGELES","1":"LOS ANGELES","2":"SACRAMENTO","3":"HOUSTON","4":"LOS ANGELES","5":"SACRAMENTO","6":"HOUSTON"},"State":{"0":"CA","1":"CA","2":"CA","3":"TX","4":"CA","5":"CA","6":"TX"},"Quantity x":{"0":null,"1":330.0,"2":4450.0,"3":440.0,"4":31113.0,"5":3220.0,"6":null},"Quantity y":{"0":445.0,"1":null,"2":566.0,"3":null,"4":3455.0,"5":null,"6":3200.0}}'
)

print("\ndf1 = \n", df1)
print("\ndf2 = \n", df2)

# Transform df1
df1["City"] = df1["City"].apply(str.upper)  # To merge, need consistent casing
df1 = df1.groupby(["Date", "City", "State"])[
    ["Cases", "Deaths"]
].sum()  # Aggregate cases + deaths just in case...


# Aggregate in df2
df2 = df2.groupby(["Date", "City", "State"])[
    ["Quantity x", "Quantity y"]
].sum()  # implicit skipna=True

print("\ndf1' = \n", df1)
print("\ndf2' = \n", df2)

# MERGE: merging on indices
df3 = df1.join(df2, how="outer")  # key: "how"
df3[["Cases", "Deaths"]] = (
    df3[["Cases", "Deaths"]].fillna(0).astype(int)
)  # inplace: downcasting complaint

df3.reset_index(
    inplace=True
)  # Will cause ["Date", "City", "State"] to be ordinary columns, not indices.

print("\ndf3 = \n", df3)

...输出为：

df1 = 
         Date         City State  Population  Cases  Deaths
0 2020-03-01  Los Angeles    CA     5000000    122      12
1 2020-03-01   Sacramento    CA     5400000    120       2
2 2020-03-01      Houston    TX     3500000     23      11
3 2021-07-01  Los Angeles    CA     5000002  12220    2200
4 2021-07-01   Sacramento    CA     5444000    211      22
5 2021-07-01      Houston    TX     4443300   2111     330

df2 = 
         Date         City State  Quantity x  Quantity y
0 2019-01-01  LOS ANGELES    CA         NaN       445.0
1 2019-01-01  LOS ANGELES    CA       330.0         NaN
2 2019-01-01   SACRAMENTO    CA      4450.0       566.0
3 2019-01-01      HOUSTON    TX       440.0         NaN
4 2021-07-01  LOS ANGELES    CA     31113.0      3455.0
5 2021-07-01   SACRAMENTO    CA      3220.0         NaN
6 2021-07-01      HOUSTON    TX         NaN      3200.0

df1' = 
                               Cases  Deaths
Date       City        State               
2020-03-01 HOUSTON     TX        23      11
           LOS ANGELES CA       122      12
           SACRAMENTO  CA       120       2
2021-07-01 HOUSTON     TX      2111     330
           LOS ANGELES CA     12220    2200
           SACRAMENTO  CA       211      22

df2' = 
                               Quantity x  Quantity y
Date       City        State                        
2019-01-01 HOUSTON     TX          440.0         0.0
           LOS ANGELES CA          330.0       445.0
           SACRAMENTO  CA         4450.0       566.0
2021-07-01 HOUSTON     TX            0.0      3200.0
           LOS ANGELES CA        31113.0      3455.0
           SACRAMENTO  CA         3220.0         0.0

df3 = 
         Date         City State  Cases  Deaths  Quantity x  Quantity y
0 2019-01-01      HOUSTON    TX      0       0       440.0         0.0
1 2019-01-01  LOS ANGELES    CA      0       0       330.0       445.0
2 2019-01-01   SACRAMENTO    CA      0       0      4450.0       566.0
3 2020-03-01      HOUSTON    TX     23      11         NaN         NaN
4 2020-03-01  LOS ANGELES    CA    122      12         NaN         NaN
5 2020-03-01   SACRAMENTO    CA    120       2         NaN         NaN
6 2021-07-01      HOUSTON    TX   2111     330         0.0      3200.0
7 2021-07-01  LOS ANGELES    CA  12220    2200     31113.0      3455.0
8 2021-07-01   SACRAMENTO    CA    211      22      3220.0         0.0

其他几点：

City大小写在join/merge时需要保持一致。
您也可以这样做：df1.merge(df2, ..., left_index=True, right_index=True) 而不是 df1.join。您也可以在 groupby-sum 行之后通过 df1.reset_index(inplace=True) 等重置索引，然后使用 .merge(..., on=...)（但索引很方便）。
Quantity {x,y} 的最终值是浮点数，因为存在 NaN。（见下一点。）
我会仔细考虑你对 NaNs v. auto-filled 0s 的处理。在 Cases/Deaths 的情况下，听起来你没有数据但是你假设 - 在没有 Cases/ Deaths 数据 - 值为 0。对于 Quantity {x,y} 个变量，这样的假设似乎没有根据。

如何在日期和列组合上合并两个面板数据集？

How to Merge two Panel data sets on Date and a combination of columns?

python

merge

dataframe

pandas