如何在日期和列组合上合并两个面板数据集?
How to Merge two Panel data sets on Date and a combination of columns?
我有两个数据集,df1
&df2
,看起来像这样:
df1
:
Date
Code
City
State
Population
Cases
Deaths
2020-03
10001
Los Angeles
CA
5000000
122
12
2020-03
10002
Sacramento
CA
5400000
120
2
2020-03
12223
Houston
TX
3500000
23
11
...
...
...
...
...
...
...
2021-07
10001
Los Angeles
CA
5000002
12220
2200
2021-07
10002
Sacramento
CA
5444000
211
22
2021-07
12223
Houston
TX
4443300
2111
330
df2
:
Date
Code
City
State
Quantity x
Quantity y
2019-01
100015
LOS ANGELES
CA
445
2019-01
100015
LOS ANGELES
CA
330
2019-01
100023
SACRAMENTO
CA
4450
566
2019-01
1222393
HOUSTON
TX
440
NA
...
...
...
...
...
...
2021-07
100015
LOS ANGELES
CA
31113
3455
2021-07
100023
SACRAMENTO
CA
3220
NA
2021-07
1222393
HOUSTON
TX
NA
3200
如您所见,df2
在 df1
之前开始,但它们都在同一日期结束。另外,df1
和df2
的ID有一些共性但不相等(一般来说,df2
比df1
多一到两位数)。
另请注意,df2
的同一日期可能有多个条目,但数量不同。
我想合并这两个,更具体地说,我想在 df2
上合并 df1
,这样它从 2019-01
开始,到 2021-07
结束。在这种情况下,Cases
和 Deaths
将在 2019-01
和 2020-02
之间为 0。
如何使用日期、城市和州(假设某些城市具有相同名称但处于不同州)并使用上述日期规范将 df1
合并到 df2
?我希望我的组合数据框 df3
看起来像这样:
Date
Code
City
State
Quantity x
Quantity y
Population
Cases
Deaths
2019-01
10001
Los Angeles
CA
445
0
0
2019-01
10002
Sacramento
CA
4450
556
0
0
2020-03
12223
Houston
TX
440
4440
35000000
23
11
...
...
...
...
...
...
...
...
...
2021-07
10002
Sacramento
CA
3220
NA
5444000
211
22
编辑:我在想:也许我可以先削减 df1
的日期范围,然后再进行合并。我不确定外部合并如何处理不一定重叠的日期。也许有人有更好的主意。
听起来您正在寻找 pd.DataFrame.merge
and pd.DataFrame.join
中的 how
关键字参数。
这是一个示例:
import pandas as pd
df1 = pd.read_json(
'{"Date":{"0":1583020800000,"1":1583020800000,"2":1583020800000,"3":1625097600000,"4":1625097600000,"5":1625097600000},"City":{"0":"Los Angeles","1":"Sacramento","2":"Houston","3":"Los Angeles","4":"Sacramento","5":"Houston"},"State":{"0":"CA","1":"CA","2":"TX","3":"CA","4":"CA","5":"TX"},"Population":{"0":5000000,"1":5400000,"2":3500000,"3":5000002,"4":5444000,"5":4443300},"Cases":{"0":122,"1":120,"2":23,"3":12220,"4":211,"5":2111},"Deaths":{"0":12,"1":2,"2":11,"3":2200,"4":22,"5":330}}'
)
df2 = pd.read_json(
'{"Date":{"0":1546300800000,"1":1546300800000,"2":1546300800000,"3":1546300800000,"4":1625097600000,"5":1625097600000,"6":1625097600000},"City":{"0":"LOS ANGELES","1":"LOS ANGELES","2":"SACRAMENTO","3":"HOUSTON","4":"LOS ANGELES","5":"SACRAMENTO","6":"HOUSTON"},"State":{"0":"CA","1":"CA","2":"CA","3":"TX","4":"CA","5":"CA","6":"TX"},"Quantity x":{"0":null,"1":330.0,"2":4450.0,"3":440.0,"4":31113.0,"5":3220.0,"6":null},"Quantity y":{"0":445.0,"1":null,"2":566.0,"3":null,"4":3455.0,"5":null,"6":3200.0}}'
)
print("\ndf1 = \n", df1)
print("\ndf2 = \n", df2)
# Transform df1
df1["City"] = df1["City"].apply(str.upper) # To merge, need consistent casing
df1 = df1.groupby(["Date", "City", "State"])[
["Cases", "Deaths"]
].sum() # Aggregate cases + deaths just in case...
# Aggregate in df2
df2 = df2.groupby(["Date", "City", "State"])[
["Quantity x", "Quantity y"]
].sum() # implicit skipna=True
print("\ndf1' = \n", df1)
print("\ndf2' = \n", df2)
# MERGE: merging on indices
df3 = df1.join(df2, how="outer") # key: "how"
df3[["Cases", "Deaths"]] = (
df3[["Cases", "Deaths"]].fillna(0).astype(int)
) # inplace: downcasting complaint
df3.reset_index(
inplace=True
) # Will cause ["Date", "City", "State"] to be ordinary columns, not indices.
print("\ndf3 = \n", df3)
...输出为:
df1 =
Date City State Population Cases Deaths
0 2020-03-01 Los Angeles CA 5000000 122 12
1 2020-03-01 Sacramento CA 5400000 120 2
2 2020-03-01 Houston TX 3500000 23 11
3 2021-07-01 Los Angeles CA 5000002 12220 2200
4 2021-07-01 Sacramento CA 5444000 211 22
5 2021-07-01 Houston TX 4443300 2111 330
df2 =
Date City State Quantity x Quantity y
0 2019-01-01 LOS ANGELES CA NaN 445.0
1 2019-01-01 LOS ANGELES CA 330.0 NaN
2 2019-01-01 SACRAMENTO CA 4450.0 566.0
3 2019-01-01 HOUSTON TX 440.0 NaN
4 2021-07-01 LOS ANGELES CA 31113.0 3455.0
5 2021-07-01 SACRAMENTO CA 3220.0 NaN
6 2021-07-01 HOUSTON TX NaN 3200.0
df1' =
Cases Deaths
Date City State
2020-03-01 HOUSTON TX 23 11
LOS ANGELES CA 122 12
SACRAMENTO CA 120 2
2021-07-01 HOUSTON TX 2111 330
LOS ANGELES CA 12220 2200
SACRAMENTO CA 211 22
df2' =
Quantity x Quantity y
Date City State
2019-01-01 HOUSTON TX 440.0 0.0
LOS ANGELES CA 330.0 445.0
SACRAMENTO CA 4450.0 566.0
2021-07-01 HOUSTON TX 0.0 3200.0
LOS ANGELES CA 31113.0 3455.0
SACRAMENTO CA 3220.0 0.0
df3 =
Date City State Cases Deaths Quantity x Quantity y
0 2019-01-01 HOUSTON TX 0 0 440.0 0.0
1 2019-01-01 LOS ANGELES CA 0 0 330.0 445.0
2 2019-01-01 SACRAMENTO CA 0 0 4450.0 566.0
3 2020-03-01 HOUSTON TX 23 11 NaN NaN
4 2020-03-01 LOS ANGELES CA 122 12 NaN NaN
5 2020-03-01 SACRAMENTO CA 120 2 NaN NaN
6 2021-07-01 HOUSTON TX 2111 330 0.0 3200.0
7 2021-07-01 LOS ANGELES CA 12220 2200 31113.0 3455.0
8 2021-07-01 SACRAMENTO CA 211 22 3220.0 0.0
其他几点:
City
大小写在join/merge时需要保持一致。
- 您也可以这样做:
df1.merge(df2, ..., left_index=True, right_index=True)
而不是 df1.join
。您也可以在 groupby-sum 行之后通过 df1.reset_index(inplace=True)
等重置索引,然后使用 .merge(..., on=...)
(但索引很方便)。
Quantity {x,y}
的最终值是浮点数,因为存在 NaN
。 (见下一点。)
- 我会仔细考虑你对
NaN
s v. auto-filled 0s 的处理。在 Cases
/Deaths
的情况下,听起来你没有数据 但是 你假设 - 在没有 Cases
/ Deaths
数据 - 值为 0
。对于 Quantity {x,y}
个变量,这样的假设似乎没有根据。
我有两个数据集,df1
&df2
,看起来像这样:
df1
:
Date | Code | City | State | Population | Cases | Deaths |
---|---|---|---|---|---|---|
2020-03 | 10001 | Los Angeles | CA | 5000000 | 122 | 12 |
2020-03 | 10002 | Sacramento | CA | 5400000 | 120 | 2 |
2020-03 | 12223 | Houston | TX | 3500000 | 23 | 11 |
... | ... | ... | ... | ... | ... | ... |
2021-07 | 10001 | Los Angeles | CA | 5000002 | 12220 | 2200 |
2021-07 | 10002 | Sacramento | CA | 5444000 | 211 | 22 |
2021-07 | 12223 | Houston | TX | 4443300 | 2111 | 330 |
df2
:
Date | Code | City | State | Quantity x | Quantity y |
---|---|---|---|---|---|
2019-01 | 100015 | LOS ANGELES | CA | 445 | |
2019-01 | 100015 | LOS ANGELES | CA | 330 | |
2019-01 | 100023 | SACRAMENTO | CA | 4450 | 566 |
2019-01 | 1222393 | HOUSTON | TX | 440 | NA |
... | ... | ... | ... | ... | ... |
2021-07 | 100015 | LOS ANGELES | CA | 31113 | 3455 |
2021-07 | 100023 | SACRAMENTO | CA | 3220 | NA |
2021-07 | 1222393 | HOUSTON | TX | NA | 3200 |
如您所见,df2
在 df1
之前开始,但它们都在同一日期结束。另外,df1
和df2
的ID有一些共性但不相等(一般来说,df2
比df1
多一到两位数)。
另请注意,df2
的同一日期可能有多个条目,但数量不同。
我想合并这两个,更具体地说,我想在 df2
上合并 df1
,这样它从 2019-01
开始,到 2021-07
结束。在这种情况下,Cases
和 Deaths
将在 2019-01
和 2020-02
之间为 0。
如何使用日期、城市和州(假设某些城市具有相同名称但处于不同州)并使用上述日期规范将 df1
合并到 df2
?我希望我的组合数据框 df3
看起来像这样:
Date | Code | City | State | Quantity x | Quantity y | Population | Cases | Deaths |
---|---|---|---|---|---|---|---|---|
2019-01 | 10001 | Los Angeles | CA | 445 | 0 | 0 | ||
2019-01 | 10002 | Sacramento | CA | 4450 | 556 | 0 | 0 | |
2020-03 | 12223 | Houston | TX | 440 | 4440 | 35000000 | 23 | 11 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2021-07 | 10002 | Sacramento | CA | 3220 | NA | 5444000 | 211 | 22 |
编辑:我在想:也许我可以先削减 df1
的日期范围,然后再进行合并。我不确定外部合并如何处理不一定重叠的日期。也许有人有更好的主意。
听起来您正在寻找 pd.DataFrame.merge
and pd.DataFrame.join
中的 how
关键字参数。
这是一个示例:
import pandas as pd
df1 = pd.read_json(
'{"Date":{"0":1583020800000,"1":1583020800000,"2":1583020800000,"3":1625097600000,"4":1625097600000,"5":1625097600000},"City":{"0":"Los Angeles","1":"Sacramento","2":"Houston","3":"Los Angeles","4":"Sacramento","5":"Houston"},"State":{"0":"CA","1":"CA","2":"TX","3":"CA","4":"CA","5":"TX"},"Population":{"0":5000000,"1":5400000,"2":3500000,"3":5000002,"4":5444000,"5":4443300},"Cases":{"0":122,"1":120,"2":23,"3":12220,"4":211,"5":2111},"Deaths":{"0":12,"1":2,"2":11,"3":2200,"4":22,"5":330}}'
)
df2 = pd.read_json(
'{"Date":{"0":1546300800000,"1":1546300800000,"2":1546300800000,"3":1546300800000,"4":1625097600000,"5":1625097600000,"6":1625097600000},"City":{"0":"LOS ANGELES","1":"LOS ANGELES","2":"SACRAMENTO","3":"HOUSTON","4":"LOS ANGELES","5":"SACRAMENTO","6":"HOUSTON"},"State":{"0":"CA","1":"CA","2":"CA","3":"TX","4":"CA","5":"CA","6":"TX"},"Quantity x":{"0":null,"1":330.0,"2":4450.0,"3":440.0,"4":31113.0,"5":3220.0,"6":null},"Quantity y":{"0":445.0,"1":null,"2":566.0,"3":null,"4":3455.0,"5":null,"6":3200.0}}'
)
print("\ndf1 = \n", df1)
print("\ndf2 = \n", df2)
# Transform df1
df1["City"] = df1["City"].apply(str.upper) # To merge, need consistent casing
df1 = df1.groupby(["Date", "City", "State"])[
["Cases", "Deaths"]
].sum() # Aggregate cases + deaths just in case...
# Aggregate in df2
df2 = df2.groupby(["Date", "City", "State"])[
["Quantity x", "Quantity y"]
].sum() # implicit skipna=True
print("\ndf1' = \n", df1)
print("\ndf2' = \n", df2)
# MERGE: merging on indices
df3 = df1.join(df2, how="outer") # key: "how"
df3[["Cases", "Deaths"]] = (
df3[["Cases", "Deaths"]].fillna(0).astype(int)
) # inplace: downcasting complaint
df3.reset_index(
inplace=True
) # Will cause ["Date", "City", "State"] to be ordinary columns, not indices.
print("\ndf3 = \n", df3)
...输出为:
df1 =
Date City State Population Cases Deaths
0 2020-03-01 Los Angeles CA 5000000 122 12
1 2020-03-01 Sacramento CA 5400000 120 2
2 2020-03-01 Houston TX 3500000 23 11
3 2021-07-01 Los Angeles CA 5000002 12220 2200
4 2021-07-01 Sacramento CA 5444000 211 22
5 2021-07-01 Houston TX 4443300 2111 330
df2 =
Date City State Quantity x Quantity y
0 2019-01-01 LOS ANGELES CA NaN 445.0
1 2019-01-01 LOS ANGELES CA 330.0 NaN
2 2019-01-01 SACRAMENTO CA 4450.0 566.0
3 2019-01-01 HOUSTON TX 440.0 NaN
4 2021-07-01 LOS ANGELES CA 31113.0 3455.0
5 2021-07-01 SACRAMENTO CA 3220.0 NaN
6 2021-07-01 HOUSTON TX NaN 3200.0
df1' =
Cases Deaths
Date City State
2020-03-01 HOUSTON TX 23 11
LOS ANGELES CA 122 12
SACRAMENTO CA 120 2
2021-07-01 HOUSTON TX 2111 330
LOS ANGELES CA 12220 2200
SACRAMENTO CA 211 22
df2' =
Quantity x Quantity y
Date City State
2019-01-01 HOUSTON TX 440.0 0.0
LOS ANGELES CA 330.0 445.0
SACRAMENTO CA 4450.0 566.0
2021-07-01 HOUSTON TX 0.0 3200.0
LOS ANGELES CA 31113.0 3455.0
SACRAMENTO CA 3220.0 0.0
df3 =
Date City State Cases Deaths Quantity x Quantity y
0 2019-01-01 HOUSTON TX 0 0 440.0 0.0
1 2019-01-01 LOS ANGELES CA 0 0 330.0 445.0
2 2019-01-01 SACRAMENTO CA 0 0 4450.0 566.0
3 2020-03-01 HOUSTON TX 23 11 NaN NaN
4 2020-03-01 LOS ANGELES CA 122 12 NaN NaN
5 2020-03-01 SACRAMENTO CA 120 2 NaN NaN
6 2021-07-01 HOUSTON TX 2111 330 0.0 3200.0
7 2021-07-01 LOS ANGELES CA 12220 2200 31113.0 3455.0
8 2021-07-01 SACRAMENTO CA 211 22 3220.0 0.0
其他几点:
City
大小写在join/merge时需要保持一致。- 您也可以这样做:
df1.merge(df2, ..., left_index=True, right_index=True)
而不是df1.join
。您也可以在 groupby-sum 行之后通过df1.reset_index(inplace=True)
等重置索引,然后使用.merge(..., on=...)
(但索引很方便)。 Quantity {x,y}
的最终值是浮点数,因为存在NaN
。 (见下一点。)- 我会仔细考虑你对
NaN
s v. auto-filled 0s 的处理。在Cases
/Deaths
的情况下,听起来你没有数据 但是 你假设 - 在没有Cases
/Deaths
数据 - 值为0
。对于Quantity {x,y}
个变量,这样的假设似乎没有根据。