策略网络为批处理状态和单个状态返回不同的输出
Policy Network returning different outputs for batched states and individual states
我正在实施应用于 CartPole-V0 openAI 健身房环境的 REINFORCE。我正在尝试相同的 2 种不同实现,我无法解决的问题如下:
将单个状态传递到策略网络后,我得到一个大小为 2 的输出张量,其中包含 2 个动作的动作概率。但是,当我将“一批状态”传递给策略网络以计算所有状态的输出动作概率时,我获得的值与每个状态单独传递给网络时的值非常不同。
谁能帮我理解这个问题?
我的代码如下:(注意:这不是完整的强化算法——我知道我需要根据概率计算损失。但我试图理解计算中的差异在继续之前,我认为这两个概率应该是相同的。)
# architecture of the Policy Network
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, n_actions):
super().__init__()
self.n_actions = n_actions
self.model = nn.Sequential(
nn.Linear(state_dim, 64),
nn.ReLU(),
nn.Linear(64, n_actions),
nn.Softmax(dim=0)
).float()
def forward(self, X):
return self.model(X)
def train_reinforce_agent(env, episode_length, max_episodes, gamma, visualize_step, learning_rate=0.003):
# define the parametric model for the Policy: this is an instantiation of the PolicyNetwork class
model = PolicyNetwork(env.observation_space.shape[0], env.action_space.n)
# define the optimizer for updating the weights of the Policy Network
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# hyperparameters of the reinforce agent
EPISODE_LENGTH = episode_length
MAX_EPISODES = max_episodes
GAMMA = gamma
VISUALIZE_STEP = max(1, visualize_step)
score = []
for episode in range(MAX_EPISODES):
# reset the environment
curr_state = env.reset()
done = False
episode_t = []
# rollout an entire episode from the Policy Network
pred_vals = []
for t in range(EPISODE_LENGTH):
act_prob = model(torch.from_numpy(curr_state).float())
pred_vals.append(act_prob)
action = np.random.choice(np.array(list(range(env.action_space.n))), p=act_prob.data.numpy())
prev_state = curr_state
curr_state, _, done, info = env.step(action)
episode_t.append((prev_state, action, t+1))
if done:
break
score.append(len(episode_t))
# reward_batch = torch.Tensor([r for (s,a,r) in episode_t]).flip(dims=(0,))
reward_batch = torch.Tensor([r for (s, a, r) in episode_t])
# compute the return for every state-action pair from the rewards at every time-step
batch_Gvals = []
for i in range(len(episode_t)):
new_Gval = 0
power = 0
for j in range(i, len(episode_t)):
new_Gval = new_Gval + ((GAMMA ** power) * reward_batch[j]).numpy()
power += 1
batch_Gvals.append(new_Gval)
# normalize the returns for the batch
expected_returns_batch = torch.FloatTensor(batch_Gvals)
if torch.is_nonzero(expected_returns_batch.max()):
expected_returns_batch /= expected_returns_batch.max()
# batch the states, actions, prob after the episode
state_batch = torch.Tensor([s for (s,a,r) in episode_t])
print("State batch:", state_batch)
all_states = [s for (s,a,r) in episode_t]
print("All states:", all_states)
action_batch = torch.Tensor([a for (s,a,r) in episode_t])
pred_batch_v1 = model(state_batch)
pred_batch_v2 = torch.stack(pred_vals)
print("Batched state pred_vals:", pred_batch_v1)
print("Individual state pred_vals:", pred_batch_v2) ### Why is this different from the above predicted values??
我传递环境的主要功能是:
def main():
env = gym.make('CartPole-v0')
# train a REINFORCE-agent to learn the optimal policy
episode_length = 500
n_episodes = 500
gamma = 0.99
vis_steps = 50
train_reinforce_agent(env, episode_length, n_episodes, gamma, vis_steps)
在你的策略中,你的 Softmax 超过暗淡的 0。这规范化了你的批次中每个动作的概率。您想通过 dim=1.
跨操作执行此操作
我正在实施应用于 CartPole-V0 openAI 健身房环境的 REINFORCE。我正在尝试相同的 2 种不同实现,我无法解决的问题如下:
将单个状态传递到策略网络后,我得到一个大小为 2 的输出张量,其中包含 2 个动作的动作概率。但是,当我将“一批状态”传递给策略网络以计算所有状态的输出动作概率时,我获得的值与每个状态单独传递给网络时的值非常不同。
谁能帮我理解这个问题?
我的代码如下:(注意:这不是完整的强化算法——我知道我需要根据概率计算损失。但我试图理解计算中的差异在继续之前,我认为这两个概率应该是相同的。)
# architecture of the Policy Network
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, n_actions):
super().__init__()
self.n_actions = n_actions
self.model = nn.Sequential(
nn.Linear(state_dim, 64),
nn.ReLU(),
nn.Linear(64, n_actions),
nn.Softmax(dim=0)
).float()
def forward(self, X):
return self.model(X)
def train_reinforce_agent(env, episode_length, max_episodes, gamma, visualize_step, learning_rate=0.003):
# define the parametric model for the Policy: this is an instantiation of the PolicyNetwork class
model = PolicyNetwork(env.observation_space.shape[0], env.action_space.n)
# define the optimizer for updating the weights of the Policy Network
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# hyperparameters of the reinforce agent
EPISODE_LENGTH = episode_length
MAX_EPISODES = max_episodes
GAMMA = gamma
VISUALIZE_STEP = max(1, visualize_step)
score = []
for episode in range(MAX_EPISODES):
# reset the environment
curr_state = env.reset()
done = False
episode_t = []
# rollout an entire episode from the Policy Network
pred_vals = []
for t in range(EPISODE_LENGTH):
act_prob = model(torch.from_numpy(curr_state).float())
pred_vals.append(act_prob)
action = np.random.choice(np.array(list(range(env.action_space.n))), p=act_prob.data.numpy())
prev_state = curr_state
curr_state, _, done, info = env.step(action)
episode_t.append((prev_state, action, t+1))
if done:
break
score.append(len(episode_t))
# reward_batch = torch.Tensor([r for (s,a,r) in episode_t]).flip(dims=(0,))
reward_batch = torch.Tensor([r for (s, a, r) in episode_t])
# compute the return for every state-action pair from the rewards at every time-step
batch_Gvals = []
for i in range(len(episode_t)):
new_Gval = 0
power = 0
for j in range(i, len(episode_t)):
new_Gval = new_Gval + ((GAMMA ** power) * reward_batch[j]).numpy()
power += 1
batch_Gvals.append(new_Gval)
# normalize the returns for the batch
expected_returns_batch = torch.FloatTensor(batch_Gvals)
if torch.is_nonzero(expected_returns_batch.max()):
expected_returns_batch /= expected_returns_batch.max()
# batch the states, actions, prob after the episode
state_batch = torch.Tensor([s for (s,a,r) in episode_t])
print("State batch:", state_batch)
all_states = [s for (s,a,r) in episode_t]
print("All states:", all_states)
action_batch = torch.Tensor([a for (s,a,r) in episode_t])
pred_batch_v1 = model(state_batch)
pred_batch_v2 = torch.stack(pred_vals)
print("Batched state pred_vals:", pred_batch_v1)
print("Individual state pred_vals:", pred_batch_v2) ### Why is this different from the above predicted values??
我传递环境的主要功能是:
def main():
env = gym.make('CartPole-v0')
# train a REINFORCE-agent to learn the optimal policy
episode_length = 500
n_episodes = 500
gamma = 0.99
vis_steps = 50
train_reinforce_agent(env, episode_length, n_episodes, gamma, vis_steps)
在你的策略中,你的 Softmax 超过暗淡的 0。这规范化了你的批次中每个动作的概率。您想通过 dim=1.
跨操作执行此操作