根据两列用另一个数据框填充一个数据框中的 NA 值
Filling NA values in one dataframe by another based on two columns
当有多列要匹配时(在本例中为城市和房间),使用另一个数据帧填充 NA 值的最有效方法是什么?
要组合的示例数据帧和结果数据帧:
import pandas as pd
import numpy as np
d1 = {'city' : ['New York', 'Shanghai', 'Boston', 'Shanghai',
'Shanghai'],
'rooms': ["1","2","3","2","2"], 'floor': ["4","5","6","10","8"], 'rent':
[500, np.nan, 1500, 2000, np.nan]}
d2 = {'city' : ['Shanghai'],
'rooms': ["2"], 'rent': [1000]}
df1 = pd.DataFrame(data = d1)
df2 = pd.DataFrame(data = d2)
result = {'city' : ['New York', 'Shanghai','Boston', 'Shanghai',
'Shanghai'],
'rooms': ["1","2","3","2","2"], 'floor': ["4","5","6","10","8"], 'rent':
[500, 1000, 1500, 2000, 1000]}
result_df = pd.DataFrame(data = result)
将两列的索引设置为对齐,并填写所需的列。在这种情况下,公共列是 city
和 rooms
:
cols = ['city', 'rooms']
为df1
设置索引:
df1 = df1.set_index(cols)
为df2
设置索引:
df2 = df2.set_index(cols).rent # make it a Series
用 df2 填充 df1 并重置索引(索引为 good/useful):
df1.fillna({"rent": df2}).reset_index()
city rooms floor rent
0 New York 1 4 500.0
1 Shanghai 2 5 1000.0
2 Boston 3 6 1500.0
3 Shanghai 2 10 2000.0
4 Shanghai 2 8 1000.0
请注意,只有当来自 df2 的数据是唯一的时,这才有效
当有多列要匹配时(在本例中为城市和房间),使用另一个数据帧填充 NA 值的最有效方法是什么?
要组合的示例数据帧和结果数据帧:
import pandas as pd
import numpy as np
d1 = {'city' : ['New York', 'Shanghai', 'Boston', 'Shanghai',
'Shanghai'],
'rooms': ["1","2","3","2","2"], 'floor': ["4","5","6","10","8"], 'rent':
[500, np.nan, 1500, 2000, np.nan]}
d2 = {'city' : ['Shanghai'],
'rooms': ["2"], 'rent': [1000]}
df1 = pd.DataFrame(data = d1)
df2 = pd.DataFrame(data = d2)
result = {'city' : ['New York', 'Shanghai','Boston', 'Shanghai',
'Shanghai'],
'rooms': ["1","2","3","2","2"], 'floor': ["4","5","6","10","8"], 'rent':
[500, 1000, 1500, 2000, 1000]}
result_df = pd.DataFrame(data = result)
将两列的索引设置为对齐,并填写所需的列。在这种情况下,公共列是 city
和 rooms
:
cols = ['city', 'rooms']
为df1
设置索引:
df1 = df1.set_index(cols)
为df2
设置索引:
df2 = df2.set_index(cols).rent # make it a Series
用 df2 填充 df1 并重置索引(索引为 good/useful):
df1.fillna({"rent": df2}).reset_index()
city rooms floor rent
0 New York 1 4 500.0
1 Shanghai 2 5 1000.0
2 Boston 3 6 1500.0
3 Shanghai 2 10 2000.0
4 Shanghai 2 8 1000.0
请注意,只有当来自 df2 的数据是唯一的时,这才有效