Vowpal Wabbit:关于根据历史数据训练上下文强盗的问题

Vowpal Wabbit: question on training contextual bandit on historical data

我从 this 页面知道,有一个选项可以根据使用某些探索策略收集的历史上下文强盗数据来训练上下文强盗 VW 模型:

VW contains a contextual bandit module which allows you to optimize a predictor based on already collected contextual bandit data. In other words, the module does not implement exploration, it assumes it can only use the currently available data logged using an exploration policy.

它是通过指定 --cb 并传递格式为 action:cost:probability | 的数据来完成的特征 :

1:2:0.4 | a c  
3:0.5:0.2 | b d  
4:1.2:0.5 | a b c  
2:1:0.3 | b c  
3:1.5:0.7 | a d 

我的问题是,是否有一种方法可以利用 而不是 的历史数据,该数据基于使用 --cb(或其他一些方法)的上下文强盗策略和一些政策评估方法?假设动作是根据一些确定性的、非探索性的(编辑:有偏见的)启发式选择的?在这种情况下,我会有 actioncost,但我没有概率(或者它等于 1) .

我尝试了一种方法,我使用探索性方法并假设历史数据已完全标记(为未知奖励分配零奖励)但似乎 PMF 在大多数操作中崩溃为零。

My question is, is there a way to leverage historical data that was not based on a contextual bandit policy using --cb (or some other method) and some policy evaluation method? Let's say actions were chosen according to some deterministic, non-exploratory heuristic? In this case, I would have the action and the cost, but I wouldn't have the probability (or it would be equal to 1).

是的,将概率设置为 1。使用退化的日志记录策略没有理论上的保证,但在实践中这可能有助于初始化。展望未来,您将希望在您的日志记录策略中有一些不确定性,否则您将永远不会改进。

I've tried a method where I use an exploratory approach and assume that the historical data is fully labelled (assign reward of zero for unknown rewards) but the PMF collapses to zero over most actions.

如果您确实拥有完全标记的历史数据,则可以使用 warm start functionality。如果你假装你有完全标记的数据,我不确定它是否比将概率设置为 1 更好。