如何合并具有不同索引但有一个共同 ID 因子的两个数据集?
How to Merge two datasets with different indexes but one common ID factor?
我正在使用两个不同的数据集:一个关于 COVID-19 统计数据,另一个关于城市的人口特征。
covid19,即 covid.df
如下所示:
注:Date, City ID, City, State 都是索引
Date
City ID
City
State
Population mean
Population_2019 mean
Confirmed_rate_100k mean
Confirmed_rate_100k std
death_rate mean
death_rate std
new_confirmed
new_deaths
2020-02
120385
Los Angeles
CA
9559699
45959669
0.653
0.556
0.6
0.01
33
5
2020-02
120054
Houtson
Texas
3304040
3343560
0.543
0.043
22.34
1.6
60
9
...
...
....
...
...
...
...
...
...
...
...
...
2022-05
120385
Los Angeles
CA
9559483
45966549
0.672
0.032
2.3
0.5
22
12
有人口统计信息的,demo.df
包括以下
City ID
HDI
Education
Mobility
Poverty
120385
0.54
72.5
55.522
33.21
120054
0.33
66.2
76.433
12.504
我想在 covid.df
上包含来自 demo.df
的信息,但是,考虑到两个数据集的索引不同,concat()
函数一直让我很为难.
如何合并两个这样的数据集,使 covid.df
看起来像这样:
Date
City ID
City
State
HDI
Education
Mobility
Poverty
Population mean
Population_2019 mean
Confirmed_rate_100k mean
Confirmed_rate_100k std
death_rate mean
death_rate std
new_confirmed
new_deaths
2020-02
120385
Los Angeles
CA
0.54
72.5
55.522
33.21
9559699
45959669
0.653
0.556
0.6
0.01
33
5
2020-02
120054
Houston
TX
0.33
66.2
76.433
12.504
3304040
3343560
0.543
0.043
22.34
1.6
60
9
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
2022-05
120385
Los Angeles
CA
0.54
72.5
55.522
33.21
9559483
45966549
0.672
0.032
2.3
0.5
22
12
谢谢!
你只需要这个:
covid = covid.merge(demo, how='left', on='City ID')
例如,假设我们有这个输入(注意 88, 99
和 'fish', 'fowl'
的不同索引):
covid.df:
Date City ID City State Population mean Population_2019 mean Confirmed_rate_100k mean Confirmed_rate_100k std death_rate mean death_rate std new_confirmed new_deaths
88 2020-02 120385 Los Angeles CA 9559699 45959669 0.653 0.556 0.60 0.01 33 5
99 2020-02 120054 Houtson Texas 3304040 3343560 0.543 0.043 22.34 1.60 60 9
demo.df:
City ID HDI Education Mobility Poverty
fish 120385 0.54 72.5 55.522 33.210
fowl 120054 0.33 66.2 76.433 12.50
输出将是
Date City ID City State Population mean Population_2019 mean Confirmed_rate_100k mean ... death_rate std new_confirmed new_deaths HDI Education Mobility Poverty
0 2020-02 120385 Los Angeles CA 9559699 45959669 0.653 ... 0.01 33 5 0.54 72.5 55.522 33.210
1 2020-02 120054 Houtson Texas 3304040 3343560 0.543 ... 1.60 60 9 0.33 66.2 76.433 12.504
[2 rows x 16 columns]
我正在使用两个不同的数据集:一个关于 COVID-19 统计数据,另一个关于城市的人口特征。
covid19,即 covid.df
如下所示:
注:Date, City ID, City, State 都是索引
Date | City ID | City | State | Population mean | Population_2019 mean | Confirmed_rate_100k mean | Confirmed_rate_100k std | death_rate mean | death_rate std | new_confirmed | new_deaths |
---|---|---|---|---|---|---|---|---|---|---|---|
2020-02 | 120385 | Los Angeles | CA | 9559699 | 45959669 | 0.653 | 0.556 | 0.6 | 0.01 | 33 | 5 |
2020-02 | 120054 | Houtson | Texas | 3304040 | 3343560 | 0.543 | 0.043 | 22.34 | 1.6 | 60 | 9 |
... | ... | .... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2022-05 | 120385 | Los Angeles | CA | 9559483 | 45966549 | 0.672 | 0.032 | 2.3 | 0.5 | 22 | 12 |
有人口统计信息的,demo.df
包括以下
City ID | HDI | Education | Mobility | Poverty |
---|---|---|---|---|
120385 | 0.54 | 72.5 | 55.522 | 33.21 |
120054 | 0.33 | 66.2 | 76.433 | 12.504 |
我想在 covid.df
上包含来自 demo.df
的信息,但是,考虑到两个数据集的索引不同,concat()
函数一直让我很为难.
如何合并两个这样的数据集,使 covid.df
看起来像这样:
Date | City ID | City | State | HDI | Education | Mobility | Poverty | Population mean | Population_2019 mean | Confirmed_rate_100k mean | Confirmed_rate_100k std | death_rate mean | death_rate std | new_confirmed | new_deaths |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2020-02 | 120385 | Los Angeles | CA | 0.54 | 72.5 | 55.522 | 33.21 | 9559699 | 45959669 | 0.653 | 0.556 | 0.6 | 0.01 | 33 | 5 |
2020-02 | 120054 | Houston | TX | 0.33 | 66.2 | 76.433 | 12.504 | 3304040 | 3343560 | 0.543 | 0.043 | 22.34 | 1.6 | 60 | 9 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2022-05 | 120385 | Los Angeles | CA | 0.54 | 72.5 | 55.522 | 33.21 | 9559483 | 45966549 | 0.672 | 0.032 | 2.3 | 0.5 | 22 | 12 |
谢谢!
你只需要这个:
covid = covid.merge(demo, how='left', on='City ID')
例如,假设我们有这个输入(注意 88, 99
和 'fish', 'fowl'
的不同索引):
covid.df:
Date City ID City State Population mean Population_2019 mean Confirmed_rate_100k mean Confirmed_rate_100k std death_rate mean death_rate std new_confirmed new_deaths
88 2020-02 120385 Los Angeles CA 9559699 45959669 0.653 0.556 0.60 0.01 33 5
99 2020-02 120054 Houtson Texas 3304040 3343560 0.543 0.043 22.34 1.60 60 9
demo.df:
City ID HDI Education Mobility Poverty
fish 120385 0.54 72.5 55.522 33.210
fowl 120054 0.33 66.2 76.433 12.50
输出将是
Date City ID City State Population mean Population_2019 mean Confirmed_rate_100k mean ... death_rate std new_confirmed new_deaths HDI Education Mobility Poverty
0 2020-02 120385 Los Angeles CA 9559699 45959669 0.653 ... 0.01 33 5 0.54 72.5 55.522 33.210
1 2020-02 120054 Houtson Texas 3304040 3343560 0.543 ... 1.60 60 9 0.33 66.2 76.433 12.504
[2 rows x 16 columns]