合并间隔和时间戳数据帧
merging intervals and timestamps dataframes
我有一个 table 其中包含间隔
dfa = pd.DataFrame({'Start': [0, 101, 666], 'Stop': [100, 200, 1000]})
我有另一个 table,其中包含时间戳和值
dfb = pd.DataFrame({'Timestamp': [102, 145, 113], 'ValueA': [1, 2, 21],
'ValueB': [1, 2, 21]})
我需要创建一个与 dfa
大小相同的数据框,并为 ValueA
/ValueB
中的所有行添加一个列,其中包含 ValueA
/ValueB
的某些聚合结果=17=] Timestamp
包含在 Start
和 Stop
之间。
所以这里如果将我的聚合定义为
{'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
我想要的输出是:
ValueA ValueA ValueB
nanmean nanmin nanmax Start Stop
nan nan nan 0 100
8 1 21 101 200
nan nan nan 666 1000
使用merge
with cross join with helper columns created by assign
:
d = {'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
df = dfa.assign(A=1).merge(dfb.assign(A=1), on='A', how='outer')
然后按 Start
和 Stop
过滤并按字典聚合:
df = (df[(df.Timestamp >= df.Start) & (df.Timestamp <= df.Stop)]
.groupby(['Start','Stop']).agg(d))
通过 map
和 join
:
将 MultiIndex 展平
df.columns = df.columns.map('_'.join)
print (df)
ValueA_nanmean ValueA_nanmin ValueB_nanmax
Start Stop
101 200 8 1 21
最后 join
到原来的:
df = dfa.join(df, on=['Start','Stop'])
print (df)
Start Stop ValueA_nanmean ValueA_nanmin ValueB_nanmax
0 0 100 NaN NaN NaN
1 101 200 8.0 1.0 21.0
2 666 1000 NaN NaN NaN
编辑:
cut
的解决方案:
d = {'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
#if not default index create it
dfa = dfa.reset_index(drop=True)
print (dfa)
Start Stop
0 0 100
1 101 200
2 666 1000
#add to bins first value of Start
bins = np.insert(dfa['Stop'].values, 0, dfa.loc[0, 'Start'])
print (bins)
[ 0 100 200 1000]
#binning
dfb['id'] = pd.cut(dfb['Timestamp'], bins=bins, labels = dfa.index)
print (dfb)
Timestamp ValueA ValueB id
0 102 1 1 1
1 145 2 2 1
2 113 21 21 1
#aggregate and flatten
df = dfb.groupby('id').agg(d)
df.columns = df.columns.map('_'.join)
#add to dfa
df = pd.concat([dfa, df], axis=1)
print (df)
Start Stop ValueA_nanmean ValueA_nanmin ValueB_nanmax
0 0 100 NaN NaN NaN
1 101 200 8.0 1.0 21.0
2 666 1000 NaN NaN NaN
我有一个 table 其中包含间隔
dfa = pd.DataFrame({'Start': [0, 101, 666], 'Stop': [100, 200, 1000]})
我有另一个 table,其中包含时间戳和值
dfb = pd.DataFrame({'Timestamp': [102, 145, 113], 'ValueA': [1, 2, 21],
'ValueB': [1, 2, 21]})
我需要创建一个与 dfa
大小相同的数据框,并为 ValueA
/ValueB
中的所有行添加一个列,其中包含 ValueA
/ValueB
的某些聚合结果=17=] Timestamp
包含在 Start
和 Stop
之间。
所以这里如果将我的聚合定义为
{'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
我想要的输出是:
ValueA ValueA ValueB
nanmean nanmin nanmax Start Stop
nan nan nan 0 100
8 1 21 101 200
nan nan nan 666 1000
使用merge
with cross join with helper columns created by assign
:
d = {'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
df = dfa.assign(A=1).merge(dfb.assign(A=1), on='A', how='outer')
然后按 Start
和 Stop
过滤并按字典聚合:
df = (df[(df.Timestamp >= df.Start) & (df.Timestamp <= df.Stop)]
.groupby(['Start','Stop']).agg(d))
通过 map
和 join
:
df.columns = df.columns.map('_'.join)
print (df)
ValueA_nanmean ValueA_nanmin ValueB_nanmax
Start Stop
101 200 8 1 21
最后 join
到原来的:
df = dfa.join(df, on=['Start','Stop'])
print (df)
Start Stop ValueA_nanmean ValueA_nanmin ValueB_nanmax
0 0 100 NaN NaN NaN
1 101 200 8.0 1.0 21.0
2 666 1000 NaN NaN NaN
编辑:
cut
的解决方案:
d = {'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
#if not default index create it
dfa = dfa.reset_index(drop=True)
print (dfa)
Start Stop
0 0 100
1 101 200
2 666 1000
#add to bins first value of Start
bins = np.insert(dfa['Stop'].values, 0, dfa.loc[0, 'Start'])
print (bins)
[ 0 100 200 1000]
#binning
dfb['id'] = pd.cut(dfb['Timestamp'], bins=bins, labels = dfa.index)
print (dfb)
Timestamp ValueA ValueB id
0 102 1 1 1
1 145 2 2 1
2 113 21 21 1
#aggregate and flatten
df = dfb.groupby('id').agg(d)
df.columns = df.columns.map('_'.join)
#add to dfa
df = pd.concat([dfa, df], axis=1)
print (df)
Start Stop ValueA_nanmean ValueA_nanmin ValueB_nanmax
0 0 100 NaN NaN NaN
1 101 200 8.0 1.0 21.0
2 666 1000 NaN NaN NaN