在pandas合并中使用"outer"如何避免丢失数据?
How to avoid losing data when using "outer" in pandas merging?
我正在使用 pandas 中的两个数据框,它们看起来像下面的那些(不过我使用的数据框有更多的列):
>>> df_download
date | app_name | downloads | app_id
__________________________________________________________
0 2021-01-01 | music app | 500 | 100
1 2021-01-01 | food delivery app | 900 | 110
2 2021-01-01 | fitness app | 1200 | 120
3 2021-01-02 | music app | 700 | 100
4 2021-01-02 | food delivery app | 750 | 110
5 2021-01-02 | fitness app | 3000 | 120
6 2021-01-03 | music app | 800 | 100
7 2021-01-03 | food delivery app | 100 | 110
8 2021-01-03 | fitness app | 400 | 120
>>> df_active_users
date | app_name | active_users | app_id
_______________________________________________________________
0 2021-01-01 | music app | 0 | 100
1 2021-01-01 | food delivery app | 30000 | 110
2 2021-01-01 | fitness app | 90000 | 120
3 2021-01-02 | music app | 15000 | 100
4 2021-01-02 | food delivery app | 0 | 110
5 2021-01-02 | fitness app | 80000 | 120
6 2021-01-03 | music app | 20000 | 100
7 2021-01-03 | food delivery app | 50000 | 110
8 2021-01-03 | fitness app | 40000 | 120
我的目标是合并它们。我设法做到了,但不知何故,我丢失了一个应用程序。送餐应用程序突然消失了。我使用了这段代码:
df_merged = pd.merge(df_active_users[['date', 'active_users', 'app_id', 'app_name']],
df_downloads[['date', 'downloads', 'app_id']], on=['date', 'app_id', 'app_name'], how='outer')
结果:
df_merged['app_name'].unique()
array(['music app','fitness app'], dtype=object)
有谁知道如何在不丢失任何应用程序数据的情况下从两个数据框中获取所有数据?
非常感谢您的帮助!
您丢失了任何东西,您的 merge
功能(几乎)有效:
out = pd.merge(df_active_users, df_downloads,
on=['date', 'app_id', 'app_name'],
how='outer')
>>> out
date app_name active_users app_id downloads
0 2021-01-01 music app 0 100 500
1 2021-01-01 food delivery app 30000 110 900
2 2021-01-01 fitness app 90000 120 1200
3 2021-01-02 music app 15000 100 700
4 2021-01-02 food delivery app 0 110 750
5 2021-01-02 fitness app 80000 120 3000
6 2021-01-03 music app 20000 100 800
7 2021-01-03 food delivery app 50000 110 100
8 2021-01-03 fitness app 40000 120 400
>>> out['app_name'].unique()
array(['music app', 'food delivery app', 'fitness app'], dtype=object)
>>> out['app_name'].value_counts()
music app 3
food delivery app 3
fitness app 3
Name: app_name, dtype: int64
我正在使用 pandas 中的两个数据框,它们看起来像下面的那些(不过我使用的数据框有更多的列):
>>> df_download
date | app_name | downloads | app_id
__________________________________________________________
0 2021-01-01 | music app | 500 | 100
1 2021-01-01 | food delivery app | 900 | 110
2 2021-01-01 | fitness app | 1200 | 120
3 2021-01-02 | music app | 700 | 100
4 2021-01-02 | food delivery app | 750 | 110
5 2021-01-02 | fitness app | 3000 | 120
6 2021-01-03 | music app | 800 | 100
7 2021-01-03 | food delivery app | 100 | 110
8 2021-01-03 | fitness app | 400 | 120
>>> df_active_users
date | app_name | active_users | app_id
_______________________________________________________________
0 2021-01-01 | music app | 0 | 100
1 2021-01-01 | food delivery app | 30000 | 110
2 2021-01-01 | fitness app | 90000 | 120
3 2021-01-02 | music app | 15000 | 100
4 2021-01-02 | food delivery app | 0 | 110
5 2021-01-02 | fitness app | 80000 | 120
6 2021-01-03 | music app | 20000 | 100
7 2021-01-03 | food delivery app | 50000 | 110
8 2021-01-03 | fitness app | 40000 | 120
我的目标是合并它们。我设法做到了,但不知何故,我丢失了一个应用程序。送餐应用程序突然消失了。我使用了这段代码:
df_merged = pd.merge(df_active_users[['date', 'active_users', 'app_id', 'app_name']],
df_downloads[['date', 'downloads', 'app_id']], on=['date', 'app_id', 'app_name'], how='outer')
结果:
df_merged['app_name'].unique()
array(['music app','fitness app'], dtype=object)
有谁知道如何在不丢失任何应用程序数据的情况下从两个数据框中获取所有数据? 非常感谢您的帮助!
您丢失了任何东西,您的 merge
功能(几乎)有效:
out = pd.merge(df_active_users, df_downloads,
on=['date', 'app_id', 'app_name'],
how='outer')
>>> out
date app_name active_users app_id downloads
0 2021-01-01 music app 0 100 500
1 2021-01-01 food delivery app 30000 110 900
2 2021-01-01 fitness app 90000 120 1200
3 2021-01-02 music app 15000 100 700
4 2021-01-02 food delivery app 0 110 750
5 2021-01-02 fitness app 80000 120 3000
6 2021-01-03 music app 20000 100 800
7 2021-01-03 food delivery app 50000 110 100
8 2021-01-03 fitness app 40000 120 400
>>> out['app_name'].unique()
array(['music app', 'food delivery app', 'fitness app'], dtype=object)
>>> out['app_name'].value_counts()
music app 3
food delivery app 3
fitness app 3
Name: app_name, dtype: int64