如何为稀疏数据正确匹配 pandas 多索引数据帧乘法
How to correctly match pandas multiindex dataframe multiplication for sparse data
我在 post 之前搜索过,我在其他人中找到了 但我认为它没有回答我的问题。
我想将稀疏数据相乘并与索引正确匹配,其中数据是多级索引。
我在不同日期对多个 element_id
的不同 attribute
进行了观察,但数据稀疏:
这是我的第二个数组 df_weight_at_date
每个 element_id
的权重列表(python 在 post 的底部创建)
对于每个日期,我想将值相乘,因此例如在我观察到的数据中 A/1/2021-01-15
(0.87) 应该乘以日期 1/2021-01-15
(0.3) 的权重以获得值0.261
如果任一值为 NaN
,则结果为 NaN
,输出帧的形状将与 df_observations
数据帧相同。
我试过使用 .multiply
但得到错误号 ValueError: cannot join with no overlapping index names
df_observations.multiply(df_weight_at_date.unstack())
此数据的预期输出
有点新手 - 希望得到任何指点,谢谢
创建数据框的代码
df_observations=pd.DataFrame({'observed_date':['2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16'],
'element_id':[1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,2,3,4,5,6,7,3,2,3,4,5,6,7],
'factor_id':['A','A','A','A','A','A','A','B','B','B','B','B','B','B','C','C','C','A','A','A','A','A','A','F','F','B','B','B','B','B'],
'observation':[0.87,0.84,0.15,0.6,0.17,0.76,0.03,0.91,0.05,0.38,0.06,0.27,0.92,0.27,0.16,0.71,0.32,0.92,0.88,0.53,0.79,0.15,0.3,0.16,0.36,0.05,0.22,0.73,0.7,0.9]}).pivot(index=['observed_date','element_id'], columns='factor_id', values='observation')
df_weight_at_date=pd.DataFrame({'observed_date':['2021-01-15','2021-01-15','2021-01-15',
'2021-01-16','2021-01-17','2021-01-18',
'2021-01-19','2021-01-20','2021-01-18'
],
'element_id':[1,3,5,1,3,5,1,3,9],
'weight':[0.3,0.35,0.35,1,1,0.4,1,1,0.6]}).pivot(index=['element_id'], columns='observed_date', values='weight')
在更正输入帧使索引名称匹配后(observation_date
-> observed_date
)这现在可以工作并且我认为足够简洁
df_observations.multiply(df_weight_at_date.unstack(), axis=0)
结果
你可以尝试解压df_weight_at_date
:
df_observations.mul(df_weight_at_date.unstack().fillna(1)
.reindex(df_observations.index, fill_value=1),
axis=0
)
输出:
factor_id A B C F
observed_date element_id
2021-01-15 1 0.2610 0.2730 0.048 NaN
2 0.8400 0.0500 0.710 NaN
3 0.0525 0.1330 0.112 NaN
4 0.6000 0.0600 NaN NaN
5 0.0595 0.0945 NaN NaN
6 0.7600 0.9200 NaN NaN
7 0.0300 0.2700 NaN NaN
2021-01-16 2 0.9200 NaN NaN 0.36
3 0.8800 0.0500 NaN 0.16
4 0.5300 0.2200 NaN NaN
5 0.7900 0.7300 NaN NaN
6 0.1500 0.7000 NaN NaN
7 0.3000 0.9000 NaN NaN
这也应该有效:
df_weight_at_date.stack().swaplevel().to_frame('A').reindex(df_observations.columns,axis=1).ffill(axis=1).mul(df_observations)
我在 post 之前搜索过,我在其他人中找到了
我想将稀疏数据相乘并与索引正确匹配,其中数据是多级索引。
我在不同日期对多个 element_id
的不同 attribute
进行了观察,但数据稀疏:
这是我的第二个数组 df_weight_at_date
每个 element_id
的权重列表(python 在 post 的底部创建)
对于每个日期,我想将值相乘,因此例如在我观察到的数据中 A/1/2021-01-15
(0.87) 应该乘以日期 1/2021-01-15
(0.3) 的权重以获得值0.261
如果任一值为 NaN
,则结果为 NaN
,输出帧的形状将与 df_observations
数据帧相同。
我试过使用 .multiply
但得到错误号 ValueError: cannot join with no overlapping index names
df_observations.multiply(df_weight_at_date.unstack())
此数据的预期输出
有点新手 - 希望得到任何指点,谢谢
创建数据框的代码
df_observations=pd.DataFrame({'observed_date':['2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-15','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16','2021-01-16'],
'element_id':[1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,2,3,4,5,6,7,3,2,3,4,5,6,7],
'factor_id':['A','A','A','A','A','A','A','B','B','B','B','B','B','B','C','C','C','A','A','A','A','A','A','F','F','B','B','B','B','B'],
'observation':[0.87,0.84,0.15,0.6,0.17,0.76,0.03,0.91,0.05,0.38,0.06,0.27,0.92,0.27,0.16,0.71,0.32,0.92,0.88,0.53,0.79,0.15,0.3,0.16,0.36,0.05,0.22,0.73,0.7,0.9]}).pivot(index=['observed_date','element_id'], columns='factor_id', values='observation')
df_weight_at_date=pd.DataFrame({'observed_date':['2021-01-15','2021-01-15','2021-01-15',
'2021-01-16','2021-01-17','2021-01-18',
'2021-01-19','2021-01-20','2021-01-18'
],
'element_id':[1,3,5,1,3,5,1,3,9],
'weight':[0.3,0.35,0.35,1,1,0.4,1,1,0.6]}).pivot(index=['element_id'], columns='observed_date', values='weight')
在更正输入帧使索引名称匹配后(observation_date
-> observed_date
)这现在可以工作并且我认为足够简洁
df_observations.multiply(df_weight_at_date.unstack(), axis=0)
结果
你可以尝试解压df_weight_at_date
:
df_observations.mul(df_weight_at_date.unstack().fillna(1)
.reindex(df_observations.index, fill_value=1),
axis=0
)
输出:
factor_id A B C F
observed_date element_id
2021-01-15 1 0.2610 0.2730 0.048 NaN
2 0.8400 0.0500 0.710 NaN
3 0.0525 0.1330 0.112 NaN
4 0.6000 0.0600 NaN NaN
5 0.0595 0.0945 NaN NaN
6 0.7600 0.9200 NaN NaN
7 0.0300 0.2700 NaN NaN
2021-01-16 2 0.9200 NaN NaN 0.36
3 0.8800 0.0500 NaN 0.16
4 0.5300 0.2200 NaN NaN
5 0.7900 0.7300 NaN NaN
6 0.1500 0.7000 NaN NaN
7 0.3000 0.9000 NaN NaN
这也应该有效:
df_weight_at_date.stack().swaplevel().to_frame('A').reindex(df_observations.columns,axis=1).ffill(axis=1).mul(df_observations)