如何学习使用 Vowpal Wabbit 的上下文强盗进行排名？

Question

我正在使用 Vowpal Wabbit 的上下文老虎机对给定上下文的各种动作进行排名。

Train Data:
"1:10:0.1 | 123"
"2:9:0.1 | 123"
"3:8:0.1 | 123"
"4:7:0.1 | 123"
"5:6:0.1 | 123"
"6:5:0.1 | 123"
"7:4:0.1 | 123"

Test Data:
" | 123"

现在，动作的预期排名应该是（从最少损失到最多损失）：

7 6 5 4 3 2 1

仅使用 --cb returns 最佳操作：

并使用 --cb_explore returns 要探索的操作的 pdf，但它似乎对排名没有帮助。

[0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.9571428298950195]

有没有其他方法可以使用大众的上下文强盗进行排名？

Answer 1

Olga 对回购的回应：https://github.com/VowpalWabbit/vowpal_wabbit/issues/2555

--cb does not do any exploration and just trains the model given the input so the output will be what the model (that has been trained so far) predicted

--cb_explore includes exploration using epsilon-greedy by default if nothing else is specified. You can take a look at all the available exploration methods here

cb_explore's output is the PMF given by the exploration strategy (see here for more info).

Epsilon-greedy will choose, with probability e, an action at random from a uniform distribution (exploration), and with probability 1-e epsilon-greedy will use the so-far trained model to predict the best action (exploitation).

So the output will be the pmf over the actions (prob. 1-e OR e for the chosen action) and then the remaining probability will be equally split between the remaining actions. Therefore cb_explore will not provide you with a ranking.

One option for ranking would be to use CCB. Then you get a ranking and can provide feedback on any slot, but it is more computationally expensive. CCB runs CB for each slot, but the effect is a ranking since each slot draws from the overall pool of actions.

我的跟进：

I think CCB is a good option if computational limits allow. I'd just like to add that if you do cb_explore or cb_explore_adf then the resulting PMF should be sorted by score so it is a ranking of sorts. However, it's worth verifying that the ordering is in fact sorted by scores (--audit will help here) as I don't know if there is a test covering this.

Answer 2

我不会使用 PMF 对动作进行排序，因为 PMF 不对应于给定上下文的每个动作的预期奖励（不像在传统的多臂强盗设置中，例如使用 Thompson Sampling，其中它确实）。

做你想做的事情的一个好方法是从动作集中采样多个动作（没有替换），这就是 CCB 子模块 所做的（Jack 的回答）。

我写了一个 tutorial and code 来说明如何实现这个（使用模拟奖励），这可能有助于分析 PMF 的更新方式以及模型如何针对您指定的奖励分配和动作集。

如何学习使用 Vowpal Wabbit 的上下文强盗进行排名？

How to learn to rank using Vowpal Wabbit's contextual bandit?

recommendation-engine

machine-learning

reinforcement-learning

vowpalwabbit

recommender-systems