如何在日期和列组合上合并两个面板数据集?

How to Merge two Panel data sets on Date and a combination of columns?

我有两个数据集,df1 &df2,看起来像这样:

df1:

Date Code City State Population Cases Deaths
2020-03 10001 Los Angeles CA 5000000 122 12
2020-03 10002 Sacramento CA 5400000 120 2
2020-03 12223 Houston TX 3500000 23 11
... ... ... ... ... ... ...
2021-07 10001 Los Angeles CA 5000002 12220 2200
2021-07 10002 Sacramento CA 5444000 211 22
2021-07 12223 Houston TX 4443300 2111 330

df2:

Date Code City State Quantity x Quantity y
2019-01 100015 LOS ANGELES CA 445
2019-01 100015 LOS ANGELES CA 330
2019-01 100023 SACRAMENTO CA 4450 566
2019-01 1222393 HOUSTON TX 440 NA
... ... ... ... ... ...
2021-07 100015 LOS ANGELES CA 31113 3455
2021-07 100023 SACRAMENTO CA 3220 NA
2021-07 1222393 HOUSTON TX NA 3200

如您所见,df2df1 之前开始,但它们都在同一日期结束。另外,df1df2的ID有一些共性但不相等(一般来说,df2df1多一到两位数)。

另请注意,df2 的同一日期可能有多个条目,但数量不同。

我想合并这两个,更具体地说,我想在 df2 上合并 df1,这样它从 2019-01 开始,到 2021-07 结束。在这种情况下,CasesDeaths 将在 2019-012020-02 之间为 0。

如何使用日期、城市和州(假设某些城市具有相同名称但处于不同州)并使用上述日期规范将 df1 合并到 df2?我希望我的组合数据框 df3 看起来像这样:

Date Code City State Quantity x Quantity y Population Cases Deaths
2019-01 10001 Los Angeles CA 445 0 0
2019-01 10002 Sacramento CA 4450 556 0 0
2020-03 12223 Houston TX 440 4440 35000000 23 11
... ... ... ... ... ... ... ... ...
2021-07 10002 Sacramento CA 3220 NA 5444000 211 22

编辑:我在想:也许我可以先削减 df1 的日期范围,然后再进行合并。我不确定外部合并如何处理不一定重叠的日期。也许有人有更好的主意。

听起来您正在寻找 pd.DataFrame.merge and pd.DataFrame.join 中的 how 关键字参数。

这是一个示例:

import pandas as pd

df1 = pd.read_json(
    '{"Date":{"0":1583020800000,"1":1583020800000,"2":1583020800000,"3":1625097600000,"4":1625097600000,"5":1625097600000},"City":{"0":"Los Angeles","1":"Sacramento","2":"Houston","3":"Los Angeles","4":"Sacramento","5":"Houston"},"State":{"0":"CA","1":"CA","2":"TX","3":"CA","4":"CA","5":"TX"},"Population":{"0":5000000,"1":5400000,"2":3500000,"3":5000002,"4":5444000,"5":4443300},"Cases":{"0":122,"1":120,"2":23,"3":12220,"4":211,"5":2111},"Deaths":{"0":12,"1":2,"2":11,"3":2200,"4":22,"5":330}}'
)
df2 = pd.read_json(
    '{"Date":{"0":1546300800000,"1":1546300800000,"2":1546300800000,"3":1546300800000,"4":1625097600000,"5":1625097600000,"6":1625097600000},"City":{"0":"LOS ANGELES","1":"LOS ANGELES","2":"SACRAMENTO","3":"HOUSTON","4":"LOS ANGELES","5":"SACRAMENTO","6":"HOUSTON"},"State":{"0":"CA","1":"CA","2":"CA","3":"TX","4":"CA","5":"CA","6":"TX"},"Quantity x":{"0":null,"1":330.0,"2":4450.0,"3":440.0,"4":31113.0,"5":3220.0,"6":null},"Quantity y":{"0":445.0,"1":null,"2":566.0,"3":null,"4":3455.0,"5":null,"6":3200.0}}'
)

print("\ndf1 = \n", df1)
print("\ndf2 = \n", df2)

# Transform df1
df1["City"] = df1["City"].apply(str.upper)  # To merge, need consistent casing
df1 = df1.groupby(["Date", "City", "State"])[
    ["Cases", "Deaths"]
].sum()  # Aggregate cases + deaths just in case...


# Aggregate in df2
df2 = df2.groupby(["Date", "City", "State"])[
    ["Quantity x", "Quantity y"]
].sum()  # implicit skipna=True

print("\ndf1' = \n", df1)
print("\ndf2' = \n", df2)

# MERGE: merging on indices
df3 = df1.join(df2, how="outer")  # key: "how"
df3[["Cases", "Deaths"]] = (
    df3[["Cases", "Deaths"]].fillna(0).astype(int)
)  # inplace: downcasting complaint

df3.reset_index(
    inplace=True
)  # Will cause ["Date", "City", "State"] to be ordinary columns, not indices.

print("\ndf3 = \n", df3)

...输出为:

df1 = 
         Date         City State  Population  Cases  Deaths
0 2020-03-01  Los Angeles    CA     5000000    122      12
1 2020-03-01   Sacramento    CA     5400000    120       2
2 2020-03-01      Houston    TX     3500000     23      11
3 2021-07-01  Los Angeles    CA     5000002  12220    2200
4 2021-07-01   Sacramento    CA     5444000    211      22
5 2021-07-01      Houston    TX     4443300   2111     330

df2 = 
         Date         City State  Quantity x  Quantity y
0 2019-01-01  LOS ANGELES    CA         NaN       445.0
1 2019-01-01  LOS ANGELES    CA       330.0         NaN
2 2019-01-01   SACRAMENTO    CA      4450.0       566.0
3 2019-01-01      HOUSTON    TX       440.0         NaN
4 2021-07-01  LOS ANGELES    CA     31113.0      3455.0
5 2021-07-01   SACRAMENTO    CA      3220.0         NaN
6 2021-07-01      HOUSTON    TX         NaN      3200.0

df1' = 
                               Cases  Deaths
Date       City        State               
2020-03-01 HOUSTON     TX        23      11
           LOS ANGELES CA       122      12
           SACRAMENTO  CA       120       2
2021-07-01 HOUSTON     TX      2111     330
           LOS ANGELES CA     12220    2200
           SACRAMENTO  CA       211      22

df2' = 
                               Quantity x  Quantity y
Date       City        State                        
2019-01-01 HOUSTON     TX          440.0         0.0
           LOS ANGELES CA          330.0       445.0
           SACRAMENTO  CA         4450.0       566.0
2021-07-01 HOUSTON     TX            0.0      3200.0
           LOS ANGELES CA        31113.0      3455.0
           SACRAMENTO  CA         3220.0         0.0

df3 = 
         Date         City State  Cases  Deaths  Quantity x  Quantity y
0 2019-01-01      HOUSTON    TX      0       0       440.0         0.0
1 2019-01-01  LOS ANGELES    CA      0       0       330.0       445.0
2 2019-01-01   SACRAMENTO    CA      0       0      4450.0       566.0
3 2020-03-01      HOUSTON    TX     23      11         NaN         NaN
4 2020-03-01  LOS ANGELES    CA    122      12         NaN         NaN
5 2020-03-01   SACRAMENTO    CA    120       2         NaN         NaN
6 2021-07-01      HOUSTON    TX   2111     330         0.0      3200.0
7 2021-07-01  LOS ANGELES    CA  12220    2200     31113.0      3455.0
8 2021-07-01   SACRAMENTO    CA    211      22      3220.0         0.0

其他几点:

  • City大小写在join/merge时需要保持一致。
  • 您也可以这样做:df1.merge(df2, ..., left_index=True, right_index=True) 而不是 df1.join。您也可以在 groupby-sum 行之后通过 df1.reset_index(inplace=True) 等重置索引,然后使用 .merge(..., on=...)(但索引很方便)。
  • Quantity {x,y} 的最终值是浮点数,因为存在 NaN。 (见下一点。)
  • 我会仔细考虑你对 NaNs v. auto-filled 0s 的处理。在 Cases/Deaths 的情况下,听起来你没有数据 但是 你假设 - 在没有 Cases/ Deaths 数据 - 值为 0。对于 Quantity {x,y} 个变量,这样的假设似乎没有根据。