如何学习使用 Vowpal Wabbit 的上下文强盗进行排名?

How to learn to rank using Vowpal Wabbit's contextual bandit?

我正在使用 Vowpal Wabbit 的上下文老虎机对给定上下文的各种动作进行排名。

Train Data:
"1:10:0.1 | 123"
"2:9:0.1 | 123"
"3:8:0.1 | 123"
"4:7:0.1 | 123"
"5:6:0.1 | 123"
"6:5:0.1 | 123"
"7:4:0.1 | 123"

Test Data:
" | 123"

现在,动作的预期排名应该是(从最少损失到最多损失):

7 6 5 4 3 2 1

仅使用 --cb returns 最佳操作:

7

并使用 --cb_explore returns 要探索的操作的 pdf,但它似乎对排名没有帮助。

[0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.9571428298950195]

有没有其他方法可以使用大众的上下文强盗进行排名?

Olga 对回购的回应:https://github.com/VowpalWabbit/vowpal_wabbit/issues/2555

--cb does not do any exploration and just trains the model given the input so the output will be what the model (that has been trained so far) predicted

--cb_explore includes exploration using epsilon-greedy by default if nothing else is specified. You can take a look at all the available exploration methods here

cb_explore's output is the PMF given by the exploration strategy (see here for more info).

Epsilon-greedy will choose, with probability e, an action at random from a uniform distribution (exploration), and with probability 1-e epsilon-greedy will use the so-far trained model to predict the best action (exploitation).

So the output will be the pmf over the actions (prob. 1-e OR e for the chosen action) and then the remaining probability will be equally split between the remaining actions. Therefore cb_explore will not provide you with a ranking.

One option for ranking would be to use CCB. Then you get a ranking and can provide feedback on any slot, but it is more computationally expensive. CCB runs CB for each slot, but the effect is a ranking since each slot draws from the overall pool of actions.

我的跟进:

I think CCB is a good option if computational limits allow. I'd just like to add that if you do cb_explore or cb_explore_adf then the resulting PMF should be sorted by score so it is a ranking of sorts. However, it's worth verifying that the ordering is in fact sorted by scores (--audit will help here) as I don't know if there is a test covering this.

我不会使用 PMF 对动作进行排序,因为 PMF 不对应于给定上下文的每个动作的预期奖励(不像在传统的多臂强盗设置中,例如使用 Thompson Sampling,其中它确实)。

做你想做的事情的一个好方法是从动作集中采样多个动作(没有替换),这就是 CCB 子模块 所做的(Jack 的回答)。

我写了一个 tutorial and code 来说明如何实现这个(使用模拟奖励),这可能有助于分析 PMF 的更新方式以及模型如何针对您指定的奖励分配和动作集。