非平稳性是什么意思以及如何将其作为 10 臂强盗问题在强化学习中实现？

Question

我已经开始学习强化学习并参考了 Sutton 的书。我试图理解书中引用的非平稳环境：

suppose the bandit task were nonstationary, that is, that the true values of the actions changed over time. In this case exploration is needed even in the deterministic case to make sure one of the nongreedy actions has not changed to become better than the greedy on

这告诉我，给定一个动作的真实预期回报值会随着时间而变化。但这是否意味着每个时间步长？我可以清楚地理解在这种情况下我们如何跟踪奖励，即在每个时间步中，最近的奖励比以前的奖励更多。但是，这是否也意味着或表明目标值或真实值随每个时间步而变化？我正在尝试使用下面给出的同一张图来模拟 10arm bandit 问题，我们将 Upper Confidence-Bound Action Selection 和 Epsilon-greedy 方法与用于估计静止环境中的动作值的样本平均方法进行比较。

如果我必须在 non_stationary 环境中模拟相同的环境，我该怎么做？下面是我的代码：

class NArmedBandit:

    #10-armed bandit testbed with sample averages 
    def __init__(self,k=10,step_size = 0.1,eps = 0,UCB_c = None, sample_avg_flag = False,
                 init_estimates = 0.0,mu = 0, std_dev = 1):
        self.k = k
        self.step_size = step_size
        self.eps = eps
        self.init_estimates = init_estimates
        self.mu = mu
        self.std_dev = std_dev
        self.actions = np.zeros(k)
        self.true_reward = 0.0
        self.UCB_c = UCB_c
        self.sample_avg_flag = sample_avg_flag
        self.re_init()


    def re_init(self):

        #true values of rewards for each action
        self.actions = np.random.normal(self.mu,self.std_dev,self.k) 

        # estimation for each action
        self.Q_t = np.zeros(self.k) + self.init_estimates

        # num of chosen times for each action
        self.N_t = np.zeros(self.k)

        #best action chosen
        self.optim_action = np.argmax(self.actions)

        self.time_step = 0


    def act(self):
        val = np.random.rand()
        if val < self.eps:
            action = np.random.choice(np.arange(self.k))
            #print('action 1:',action)
        elif self.UCB_c is not None:
            #1e-5 is added so as to avoid division by zero
            ucb_estimates = self.Q_t + self.UCB_c * np.sqrt(np.log(self.time_step + 1) / (self.N_t + 1e-5))
            A_t = np.max(ucb_estimates)
            action = np.random.choice(np.where(ucb_estimates == A_t)[0])
        else:
            A_t = np.max(self.Q_t)
            action = np.random.choice(np.where(self.Q_t == A_t)[0])
            #print('action 2:',action)
        return action



    def step(self,action):

        # generating the reward under N(real reward, 1)
        reward = np.random.randn() + self.actions[action]
        self.time_step += 1
        self.N_t[action] += 1


        # estimation with sample averages
        if self.sample_avg_flag == True:
            self.Q_t[action] += (reward - self.Q_t[action]) / self.N_t[action]
        else:
            # non-staationary with constant step size 
            self.Q_t[action] += self.step_size * (reward - self.Q_t[action])

        return reward


    def play(self,tasks,num_time_steps):
        rewards = np.zeros((tasks, num_time_steps))
        optim_action_counts = np.zeros(rewards.shape)
        for task in trange(tasks):
            self.re_init()
            for t in range(num_time_steps):
                action = self.act()
                reward = self.step(action)
                rewards[task, t] = reward
                if action == self.optim_action:
                    optim_action_counts[task, t] = 1
        avg_optim_action_counts = optim_action_counts.mean(axis=0)
        avg_rewards = rewards.mean(axis=0)
        return avg_optim_action_counts, avg_rewards

我是否应该通过在 play() 中的每个时间步之后调用 re_init() 函数来更改 re_init() 函数中定义的 actions array（假设为真实估计）比如在每个时间步改变每个动作的真实预期奖励。我已经在使用恒定步长 alpha = 0.1 的 act() 和 step() 函数中合并了在非平稳环境下计算奖励的代码。我唯一不知道的是这里如何设置或模拟非静止环境，以及我是否正确理解它。

Answer 1

你对非平稳的理解正确。如你所见"that the true values of the actions changed over time."

但是，它们是如何变化的？

其实并没有明确定义。从我的角度来看，您的 re_init 方法是正确的。当它们发生变化时，您需要决定什么。但是有一点很清楚，如果你每一步都改变奖励，那么就没有什么可以学习的，因为你每一步都在改变所有要学习的奖励。我可以提供两种解决方案来满足非平稳定义。

你每 100 或 1000 步调用 re_init eps 的概率很小。
您可以从初始值开始，然后将小的随机 +/- 值添加到您的初始值。那么你的回报就会朝着积极或消极的方向漂移。

非平稳性是什么意思以及如何将其作为 10 臂强盗问题在强化学习中实现？

What does non-stationarity mean and how to implement it in reinforcement learning as 10 arm bandit problem?

python

reinforcement-learning