如何学习使用 Vowpal Wabbit 的上下文强盗进行排名?
How to learn to rank using Vowpal Wabbit's contextual bandit?
我正在使用 Vowpal Wabbit 的上下文老虎机对给定上下文的各种动作进行排名。
Train Data:
"1:10:0.1 | 123"
"2:9:0.1 | 123"
"3:8:0.1 | 123"
"4:7:0.1 | 123"
"5:6:0.1 | 123"
"6:5:0.1 | 123"
"7:4:0.1 | 123"
Test Data:
" | 123"
现在,动作的预期排名应该是(从最少损失到最多损失):
7 6 5 4 3 2 1
仅使用 --cb
returns 最佳操作:
7
并使用 --cb_explore
returns 要探索的操作的 pdf,但它似乎对排名没有帮助。
[0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.9571428298950195]
有没有其他方法可以使用大众的上下文强盗进行排名?
Olga 对回购的回应:https://github.com/VowpalWabbit/vowpal_wabbit/issues/2555
--cb does not do any exploration and just trains the model given the input so the output will be what the model (that has been trained so
far) predicted
--cb_explore includes exploration using epsilon-greedy by default if nothing else is specified. You can take a look at all the available
exploration methods here
cb_explore's output is the PMF given by the exploration strategy (see
here for more info).
Epsilon-greedy will choose, with probability e, an action at random
from a uniform distribution (exploration), and with probability 1-e
epsilon-greedy will use the so-far trained model to predict the best
action (exploitation).
So the output will be the pmf over the actions (prob. 1-e OR e for the
chosen action) and then the remaining probability will be equally
split between the remaining actions. Therefore cb_explore will not
provide you with a ranking.
One option for ranking would be to use CCB. Then you get a ranking and
can provide feedback on any slot, but it is more computationally
expensive. CCB runs CB for each slot, but the effect is a ranking
since each slot draws from the overall pool of actions.
我的跟进:
I think CCB is a good option if computational limits allow. I'd just
like to add that if you do cb_explore or cb_explore_adf then the
resulting PMF should be sorted by score so it is a ranking of sorts.
However, it's worth verifying that the ordering is in fact sorted by
scores (--audit will help here) as I don't know if there is a test
covering this.
我不会使用 PMF 对动作进行排序,因为 PMF 不对应于给定上下文的每个动作的预期奖励(不像在传统的多臂强盗设置中,例如使用 Thompson Sampling,其中它确实)。
做你想做的事情的一个好方法是从动作集中采样多个动作(没有替换),这就是 CCB 子模块 所做的(Jack 的回答)。
我写了一个 tutorial and code 来说明如何实现这个(使用模拟奖励),这可能有助于分析 PMF 的更新方式以及模型如何针对您指定的奖励分配和动作集。
我正在使用 Vowpal Wabbit 的上下文老虎机对给定上下文的各种动作进行排名。
Train Data:
"1:10:0.1 | 123"
"2:9:0.1 | 123"
"3:8:0.1 | 123"
"4:7:0.1 | 123"
"5:6:0.1 | 123"
"6:5:0.1 | 123"
"7:4:0.1 | 123"
Test Data:
" | 123"
现在,动作的预期排名应该是(从最少损失到最多损失):
7 6 5 4 3 2 1
仅使用 --cb
returns 最佳操作:
7
并使用 --cb_explore
returns 要探索的操作的 pdf,但它似乎对排名没有帮助。
[0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.9571428298950195]
有没有其他方法可以使用大众的上下文强盗进行排名?
Olga 对回购的回应:https://github.com/VowpalWabbit/vowpal_wabbit/issues/2555
--cb does not do any exploration and just trains the model given the input so the output will be what the model (that has been trained so far) predicted
--cb_explore includes exploration using epsilon-greedy by default if nothing else is specified. You can take a look at all the available exploration methods here
cb_explore's output is the PMF given by the exploration strategy (see here for more info).
Epsilon-greedy will choose, with probability e, an action at random from a uniform distribution (exploration), and with probability 1-e epsilon-greedy will use the so-far trained model to predict the best action (exploitation).
So the output will be the pmf over the actions (prob. 1-e OR e for the chosen action) and then the remaining probability will be equally split between the remaining actions. Therefore cb_explore will not provide you with a ranking.
One option for ranking would be to use CCB. Then you get a ranking and can provide feedback on any slot, but it is more computationally expensive. CCB runs CB for each slot, but the effect is a ranking since each slot draws from the overall pool of actions.
我的跟进:
I think CCB is a good option if computational limits allow. I'd just like to add that if you do cb_explore or cb_explore_adf then the resulting PMF should be sorted by score so it is a ranking of sorts. However, it's worth verifying that the ordering is in fact sorted by scores (--audit will help here) as I don't know if there is a test covering this.
我不会使用 PMF 对动作进行排序,因为 PMF 不对应于给定上下文的每个动作的预期奖励(不像在传统的多臂强盗设置中,例如使用 Thompson Sampling,其中它确实)。
做你想做的事情的一个好方法是从动作集中采样多个动作(没有替换),这就是 CCB 子模块 所做的(Jack 的回答)。
我写了一个 tutorial and code 来说明如何实现这个(使用模拟奖励),这可能有助于分析 PMF 的更新方式以及模型如何针对您指定的奖励分配和动作集。