dataframe combine_first 不如 fillna
dataframe combine_first does not work as well as fillna
第一个数据帧是:
data_date cookie_type dau next_dau dau_7 dau_15
0 20181006 avg(0-d) 2288 NaN NaN NaN
1 20181006 avg(e-f) 2284 NaN NaN NaN
2 20181007 avg(e-f) 2296 100 NaN NaN
第二个数据帧是:
data_date cookie_type next_dau
0 20181006 avg(e-f) 908
1 20181006 avg(0-d) 904
如何从第二个数据帧更新第一个数据帧的 next_dau
我试过 combine_first 和 fillna,它们似乎不支持多索引:
cols = ['data_date', 'cookie_type']
if (frame1 is not None and not frame1.empty):
frame1.set_index(cols)
print(frame1)
print(next_day_dau)
frame1.combine_first(next_day_dau.set_index(cols))
frame1.combine_first(dau_7.set_index(cols))
frame1.combine_first(dau_15.set_index(cols))
然后我更新为:
frame1.index = frame1.data_date.astype(str) + frame1.cookie_type
next_day_dau.index = next_day_dau.data_date.astype(str) + next_day_dau.cookie_type
dau_7.index = dau_7.data_date.astype(str) + dau_7.cookie_type
dau_15.index = dau_15.data_date.astype(str) + dau_15.cookie_type
"""frame1.loc[next_day_dau.index, "next_dau"] = next_day_dau.next_dau
frame1.loc[dau_7.index, "dau_7"] = dau_7.dau_7
frame1.loc[dau_15.index, "dau_15"] = dau_15.dau_15"""
frame1.combine_first(next_day_dau)
frame1.combine_first(dau_7)
frame1.combine_first(dau_15)
print(frame1)
print(next_day_dau)
loc 引发错误,因为 next_day_dau 不包含 frame1 中的所有索引,然后我尝试了 combine-first 和 fillna with inplace=True,都不起作用。
{'data_date': {'20181007avg(0-d)': 20181007, '20181007avg(e-f)': 20181007, '20181006avg(0-d)': 20181006, '20181006avg(e-f)': 20181006}, 'cookie_type': {'20181007avg(0-d)': 'avg(0-d)', '20181007avg(e-f)': 'avg(e-f)', '20181006avg(0-d)': 'avg(0-d)', '20181006avg(e-f)': 'avg(e-f)'}, 'dau': {'20181007avg(0-d)': 2288, '20181007avg(e-f)': 2284, '20181006avg(0-d)': 2288, '20181006avg(e-f)': 2284}, 'next_dau': {'20181007avg(0-d)': nan, '20181007avg(e-f)': nan, '20181006avg(0-d)': nan, '20181006avg(e-f)': nan}, 'dau_7': {'20181007avg(0-d)': nan, '20181007avg(e-f)': nan, '20181006avg(0-d)': nan, '20181006avg(e-f)': nan}, 'dau_15': {'20181007avg(0-d)': nan, '20181007avg(e-f)': nan, '20181006avg(0-d)': nan, '20181006avg(e-f)': nan}}
{'data_date': {0: '20181007', 1: '20181007'}, 'cookie_type': {0: 'avg(e-f)', 1: 'avg(0-d)'}, 'next_dau': {0: 2284, 1: 2288}}
您可以使用 pandas merge
来解决您的用例。更多文档在这里:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
print(t1)
cookie_type data_date next_dau
0 avg(0-d) 20181006 1
1 avg(e-f) 20181006 2
2 avg(e-f) 20181007 NaN
print(t2)
cookie_type data_date next_dau
0 avg(e-f) 20181006 908
1 avg(0-d) 20181006 904
2 avg(e-f) 20181007 905
result = pd.merge(t1, t2, on=['data_date', 'cookie_type'])
cookie_type data_date next_dau_x next_dau_y
0 avg(0-d) 20181006 1 904
1 avg(e-f) 20181006 2 908
2 avg(e-f) 20181007 NaN 905
现在,要仅更新 而非 Nan 值,您可以使用 where
子句。
result['col'] = result['next_dau_x'].where(result['next_dau_x'].notnull(), result['next_dau_y'])
现在,删除不需要的列。
result = result.drop(['next_dau_x','next_dau_y'], axis=1)
cookie_type data_date col
0 avg(0-d) 20181006 1
1 avg(e-f) 20181006 2
2 avg(e-f) 20181007 905
最后我在"tianhua liao"的帮助下解决了这个问题:
frame1.index = frame1.data_date.astype(str) + frame1.cookie_type
next_day_dau.index = next_day_dau.data_date.astype(str) + next_day_dau.cookie_type
dau_7.index = dau_7.data_date.astype(str) + dau_7.cookie_type
dau_15.index = dau_15.data_date.astype(str) + dau_15.cookie_type
# get_index
next_day_dau_idx = frame1.index.isin(next_day_dau.index)
dau_7_idx = frame1.index.isin(dau_7.index)
dau_15_idx = frame1.index.isin(dau_15.index)
#
if any(next_day_dau_idx):
frame1.loc[next_day_dau_idx, "next_dau"] = next_day_dau.next_dau
if any(dau_7_idx):
frame1.loc[dau_7_idx, "dau_7"] = dau_7.dau_7
if any(dau_15_idx):
frame1.loc[dau_15_idx, "dau_15"] = dau_15.dau_15
第一个数据帧是:
data_date cookie_type dau next_dau dau_7 dau_15
0 20181006 avg(0-d) 2288 NaN NaN NaN
1 20181006 avg(e-f) 2284 NaN NaN NaN
2 20181007 avg(e-f) 2296 100 NaN NaN
第二个数据帧是:
data_date cookie_type next_dau
0 20181006 avg(e-f) 908
1 20181006 avg(0-d) 904
如何从第二个数据帧更新第一个数据帧的 next_dau 我试过 combine_first 和 fillna,它们似乎不支持多索引:
cols = ['data_date', 'cookie_type']
if (frame1 is not None and not frame1.empty):
frame1.set_index(cols)
print(frame1)
print(next_day_dau)
frame1.combine_first(next_day_dau.set_index(cols))
frame1.combine_first(dau_7.set_index(cols))
frame1.combine_first(dau_15.set_index(cols))
然后我更新为:
frame1.index = frame1.data_date.astype(str) + frame1.cookie_type
next_day_dau.index = next_day_dau.data_date.astype(str) + next_day_dau.cookie_type
dau_7.index = dau_7.data_date.astype(str) + dau_7.cookie_type
dau_15.index = dau_15.data_date.astype(str) + dau_15.cookie_type
"""frame1.loc[next_day_dau.index, "next_dau"] = next_day_dau.next_dau
frame1.loc[dau_7.index, "dau_7"] = dau_7.dau_7
frame1.loc[dau_15.index, "dau_15"] = dau_15.dau_15"""
frame1.combine_first(next_day_dau)
frame1.combine_first(dau_7)
frame1.combine_first(dau_15)
print(frame1)
print(next_day_dau)
loc 引发错误,因为 next_day_dau 不包含 frame1 中的所有索引,然后我尝试了 combine-first 和 fillna with inplace=True,都不起作用。
{'data_date': {'20181007avg(0-d)': 20181007, '20181007avg(e-f)': 20181007, '20181006avg(0-d)': 20181006, '20181006avg(e-f)': 20181006}, 'cookie_type': {'20181007avg(0-d)': 'avg(0-d)', '20181007avg(e-f)': 'avg(e-f)', '20181006avg(0-d)': 'avg(0-d)', '20181006avg(e-f)': 'avg(e-f)'}, 'dau': {'20181007avg(0-d)': 2288, '20181007avg(e-f)': 2284, '20181006avg(0-d)': 2288, '20181006avg(e-f)': 2284}, 'next_dau': {'20181007avg(0-d)': nan, '20181007avg(e-f)': nan, '20181006avg(0-d)': nan, '20181006avg(e-f)': nan}, 'dau_7': {'20181007avg(0-d)': nan, '20181007avg(e-f)': nan, '20181006avg(0-d)': nan, '20181006avg(e-f)': nan}, 'dau_15': {'20181007avg(0-d)': nan, '20181007avg(e-f)': nan, '20181006avg(0-d)': nan, '20181006avg(e-f)': nan}}
{'data_date': {0: '20181007', 1: '20181007'}, 'cookie_type': {0: 'avg(e-f)', 1: 'avg(0-d)'}, 'next_dau': {0: 2284, 1: 2288}}
您可以使用 pandas merge
来解决您的用例。更多文档在这里:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
print(t1)
cookie_type data_date next_dau
0 avg(0-d) 20181006 1
1 avg(e-f) 20181006 2
2 avg(e-f) 20181007 NaN
print(t2)
cookie_type data_date next_dau
0 avg(e-f) 20181006 908
1 avg(0-d) 20181006 904
2 avg(e-f) 20181007 905
result = pd.merge(t1, t2, on=['data_date', 'cookie_type'])
cookie_type data_date next_dau_x next_dau_y
0 avg(0-d) 20181006 1 904
1 avg(e-f) 20181006 2 908
2 avg(e-f) 20181007 NaN 905
现在,要仅更新 而非 Nan 值,您可以使用 where
子句。
result['col'] = result['next_dau_x'].where(result['next_dau_x'].notnull(), result['next_dau_y'])
现在,删除不需要的列。
result = result.drop(['next_dau_x','next_dau_y'], axis=1)
cookie_type data_date col
0 avg(0-d) 20181006 1
1 avg(e-f) 20181006 2
2 avg(e-f) 20181007 905
最后我在"tianhua liao"的帮助下解决了这个问题:
frame1.index = frame1.data_date.astype(str) + frame1.cookie_type
next_day_dau.index = next_day_dau.data_date.astype(str) + next_day_dau.cookie_type
dau_7.index = dau_7.data_date.astype(str) + dau_7.cookie_type
dau_15.index = dau_15.data_date.astype(str) + dau_15.cookie_type
# get_index
next_day_dau_idx = frame1.index.isin(next_day_dau.index)
dau_7_idx = frame1.index.isin(dau_7.index)
dau_15_idx = frame1.index.isin(dau_15.index)
#
if any(next_day_dau_idx):
frame1.loc[next_day_dau_idx, "next_dau"] = next_day_dau.next_dau
if any(dau_7_idx):
frame1.loc[dau_7_idx, "dau_7"] = dau_7.dau_7
if any(dau_15_idx):
frame1.loc[dau_15_idx, "dau_15"] = dau_15.dau_15