在多个条件下左连接两个 pandas 数据帧
Left join two pandas dataframes under multiple conditions
我有两个数据框。其中一个是网上商店中用户的搜索查询(102377 行),另一个是用户在搜索外的点击次数(8004 行)。
queries:
index term timestamp
...
10 tight 2018-09-27 20:09:23
11 differential pressure 2018-09-27 20:09:30
12 soot pump 2018-09-27 20:09:32
13 gas pressure 2018-09-27 20:09:46
14 case 2018-09-27 20:11:29
15 backpack 2018-09-27 20:18:35
...
clicks
index term timestamp artnr
...
245 soot pump 2018-09-27 20:09:25 9150.0
246 dungarees 2018-09-27 20:10:38 7228.0
247 db23 2018-09-27 20:10:40 7966.0
248 db23 2018-09-27 20:10:55 7971.0
249 sealing blister 2018-09-27 20:12:05 7971.0
250 backpack 2018-09-27 20:18:40 8739.0
...
我想做的是在查询中加入点击。如果 queries.term 等于 clicks.term 并且 clicks.timestamp - queries.timestamp 之间的差异小于 10 秒且大于 0 秒,则查询数据帧的术语应替换为点击的 artnr数据框,使其看起来像:
queries:
index term timestamp
...
10 tight 2018-09-27 20:09:23
11 differential pressure 2018-09-27 20:09:30
12 9150.0 2018-09-27 20:09:32
13 gas pressure 2018-09-27 20:09:46
14 case 2018-09-27 20:11:29
15 8739.0 2018-09-27 20:18:35
...
我的第一个方法如下:
df_Q['term'] = np.where(((((df_CS.timestamp-df_Q.timestamp).dt.total_seconds() <= 10.0) &
(df_CS.timestamp-df_Q.timestamp).dt.total_seconds() >= 0) &
(df_CS.term.str == df_Q.term.str)), df_CS['artnr'], df_CS['term'])
但这只是产生了以下错误:
ValueError: operands could not be broadcast together with shapes
(102377,) (8004,) (8004,)
有人知道如何用左连接或其他解决方案解决这个问题吗?
queries = pd.DataFrame({'term': ['tight', 'differential pressure', 'soot pump', 'gas pressure', 'case', 'backpack'],
'timestamp': ['2018-09-27 20:09:23', '2018-09-27 20:09:30', '2018-09-27 20:09:32', '2018-09-27 20:09:46', '2018-09-27 20:11:29', '2018-09-27 20:18:35']})
print(queries)
term timestamp
0 tight 2018-09-27 20:09:23
1 differential pressure 2018-09-27 20:09:30
2 soot pump 2018-09-27 20:09:32
3 gas pressure 2018-09-27 20:09:46
4 case 2018-09-27 20:11:29
5 backpack 2018-09-27 20:18:35
clicks = pd.DataFrame({'term': ['soot pump', 'dungarees', 'db23', 'db23', 'sealing blister', 'backpack'],
'timestamp': ['2018-09-27 20:09:25', '2018-09-27 20:10:38', '2018-09-27 20:10:40', '2018-09-27 20:10:55', '2018-09-27 20:12:05', '2018-09-27 20:18:40'],
'artnr':[9150.0, 7228.0, 7966.0, 7971.0, 7971.0, 8739.0]})
print(clicks)
term timestamp artnr
0 soot pump 2018-09-27 20:09:25 9150.0
1 dungarees 2018-09-27 20:10:38 7228.0
2 db23 2018-09-27 20:10:40 7966.0
3 db23 2018-09-27 20:10:55 7971.0
4 sealing blister 2018-09-27 20:12:05 7971.0
5 backpack 2018-09-27 20:18:40 8739.0
首先,根据时间戳对两个数据帧进行排序
queries['timestamp'] = pd.to_datetime(queries['timestamp'])
clicks['timestamp'] = pd.to_datetime(clicks['timestamp'])
queries.sort_values('timestamp', ascending=True, inplace=True)
clicks.sort_values('timestamp', ascending=True, inplace=True)
然后使用 pd.merge_asof() 加入 'term' 列并且仅当 'timestamp' 的时间差在 10 秒内
df = pd.merge_asof(
queries, # left data
clicks, # right data
on="timestamp", # column to check time differnece
by="term", # column to join on
tolerance=pd.Timedelta("10s"), # time difference
direction='forward', # join only if timestamp in right data after timestamp in left data
)
如果未找到匹配项,'artnr' 列将显示为 NA。所以使用 'artnr' 的非 NA 值来替换 'term'
df['term'][df['artnr'].notna()] = df['artnr']
print(df)
term timestamp artnr
0 tight 2018-09-27 20:09:23 NaN
1 differential pressure 2018-09-27 20:09:30 NaN
2 soot pump 2018-09-27 20:09:32 NaN
3 gas pressure 2018-09-27 20:09:46 NaN
4 case 2018-09-27 20:11:29 NaN
5 8739 2018-09-27 20:18:35 8739.0
我有两个数据框。其中一个是网上商店中用户的搜索查询(102377 行),另一个是用户在搜索外的点击次数(8004 行)。
queries:
index term timestamp
...
10 tight 2018-09-27 20:09:23
11 differential pressure 2018-09-27 20:09:30
12 soot pump 2018-09-27 20:09:32
13 gas pressure 2018-09-27 20:09:46
14 case 2018-09-27 20:11:29
15 backpack 2018-09-27 20:18:35
...
clicks
index term timestamp artnr
...
245 soot pump 2018-09-27 20:09:25 9150.0
246 dungarees 2018-09-27 20:10:38 7228.0
247 db23 2018-09-27 20:10:40 7966.0
248 db23 2018-09-27 20:10:55 7971.0
249 sealing blister 2018-09-27 20:12:05 7971.0
250 backpack 2018-09-27 20:18:40 8739.0
...
我想做的是在查询中加入点击。如果 queries.term 等于 clicks.term 并且 clicks.timestamp - queries.timestamp 之间的差异小于 10 秒且大于 0 秒,则查询数据帧的术语应替换为点击的 artnr数据框,使其看起来像:
queries:
index term timestamp
...
10 tight 2018-09-27 20:09:23
11 differential pressure 2018-09-27 20:09:30
12 9150.0 2018-09-27 20:09:32
13 gas pressure 2018-09-27 20:09:46
14 case 2018-09-27 20:11:29
15 8739.0 2018-09-27 20:18:35
...
我的第一个方法如下:
df_Q['term'] = np.where(((((df_CS.timestamp-df_Q.timestamp).dt.total_seconds() <= 10.0) &
(df_CS.timestamp-df_Q.timestamp).dt.total_seconds() >= 0) &
(df_CS.term.str == df_Q.term.str)), df_CS['artnr'], df_CS['term'])
但这只是产生了以下错误:
ValueError: operands could not be broadcast together with shapes (102377,) (8004,) (8004,)
有人知道如何用左连接或其他解决方案解决这个问题吗?
queries = pd.DataFrame({'term': ['tight', 'differential pressure', 'soot pump', 'gas pressure', 'case', 'backpack'],
'timestamp': ['2018-09-27 20:09:23', '2018-09-27 20:09:30', '2018-09-27 20:09:32', '2018-09-27 20:09:46', '2018-09-27 20:11:29', '2018-09-27 20:18:35']})
print(queries)
term timestamp
0 tight 2018-09-27 20:09:23
1 differential pressure 2018-09-27 20:09:30
2 soot pump 2018-09-27 20:09:32
3 gas pressure 2018-09-27 20:09:46
4 case 2018-09-27 20:11:29
5 backpack 2018-09-27 20:18:35
clicks = pd.DataFrame({'term': ['soot pump', 'dungarees', 'db23', 'db23', 'sealing blister', 'backpack'],
'timestamp': ['2018-09-27 20:09:25', '2018-09-27 20:10:38', '2018-09-27 20:10:40', '2018-09-27 20:10:55', '2018-09-27 20:12:05', '2018-09-27 20:18:40'],
'artnr':[9150.0, 7228.0, 7966.0, 7971.0, 7971.0, 8739.0]})
print(clicks)
term timestamp artnr
0 soot pump 2018-09-27 20:09:25 9150.0
1 dungarees 2018-09-27 20:10:38 7228.0
2 db23 2018-09-27 20:10:40 7966.0
3 db23 2018-09-27 20:10:55 7971.0
4 sealing blister 2018-09-27 20:12:05 7971.0
5 backpack 2018-09-27 20:18:40 8739.0
首先,根据时间戳对两个数据帧进行排序
queries['timestamp'] = pd.to_datetime(queries['timestamp'])
clicks['timestamp'] = pd.to_datetime(clicks['timestamp'])
queries.sort_values('timestamp', ascending=True, inplace=True)
clicks.sort_values('timestamp', ascending=True, inplace=True)
然后使用 pd.merge_asof() 加入 'term' 列并且仅当 'timestamp' 的时间差在 10 秒内
df = pd.merge_asof(
queries, # left data
clicks, # right data
on="timestamp", # column to check time differnece
by="term", # column to join on
tolerance=pd.Timedelta("10s"), # time difference
direction='forward', # join only if timestamp in right data after timestamp in left data
)
如果未找到匹配项,'artnr' 列将显示为 NA。所以使用 'artnr' 的非 NA 值来替换 'term'
df['term'][df['artnr'].notna()] = df['artnr']
print(df)
term timestamp artnr
0 tight 2018-09-27 20:09:23 NaN
1 differential pressure 2018-09-27 20:09:30 NaN
2 soot pump 2018-09-27 20:09:32 NaN
3 gas pressure 2018-09-27 20:09:46 NaN
4 case 2018-09-27 20:11:29 NaN
5 8739 2018-09-27 20:18:35 8739.0