使用 lstm 训练的模型需要多少个 epoch
how many epochs required for model with lstm training
我的 AI 中有 nn Actor Critic TD3 模型和 LSTM。
对于每次训练,我都会创建批量的顺序数据并训练我的 AI。
请专家帮忙告诉我是否也需要这个 AI.And 总的来说我可以用这段代码 运行 多少个 epoch,因为我在一个上创建了很多批次训练步骤是否也可以有epochs。
下面是训练步骤代码
def train(
self,
replay_buffer,
iterations,
batch_size=50,
discount=0.99,
tau=0.005,
policy_noise=0.2,
noise_clip=0.5,
policy_freq=2,
):
b_state = torch.Tensor([])
b_next_state = torch.Tensor([])
b_done = torch.Tensor([])
b_reward = torch.Tensor([])
b_action = torch.Tensor([])
for it in range(iterations):
# print ('it: ', it, ' iterations: ', iterations)
# Step 4: We sample a batch of transitions (s, s’, a, r) from the memory
(batch_states, batch_next_states, batch_actions,
batch_rewards, batch_dones) = \
replay_buffer.sample(batch_size)
batch_states = batch_states.astype(float)
batch_next_states = batch_next_states.astype(float)
batch_actions = batch_actions.astype(float)
batch_rewards = batch_rewards.astype(float)
batch_dones = batch_dones.astype(float)
state = torch.from_numpy(batch_states)
next_state = torch.from_numpy(batch_next_states)
action = torch.from_numpy(batch_actions)
reward = torch.from_numpy(batch_rewards)
done = torch.from_numpy(batch_dones)
b_size = 1
seq_len = state.shape[0]
batch = b_size
input_size = state_dim
state = torch.reshape(state, ( 1,seq_len, state_dim))
next_state = torch.reshape(next_state, ( 1,seq_len, state_dim))
done = torch.reshape(done, ( 1,seq_len, 1))
reward = torch.reshape(reward, ( 1, seq_len, 1))
action = torch.reshape(action, ( 1, seq_len, action_dim))
b_state = torch.cat((b_state, state),dim=0)
b_next_state = torch.cat((b_next_state, next_state),dim=0)
b_done = torch.cat((b_done, done),dim=0)
b_reward = torch.cat((b_reward, reward),dim=0)
b_action = torch.cat((b_action, action),dim=0)
# state = torch.reshape(state, (seq_len, 1, state_dim))
# next_state = torch.reshape(next_state, (seq_len, 1,
# state_dim))
# done = torch.reshape(done, (seq_len, 1, 1))
# reward = torch.reshape(reward, (seq_len, 1, 1))
# action = torch.reshape(action, (seq_len, 1, action_dim))
# b_state = torch.cat((b_state, state),dim=1)
# b_next_state = torch.cat((b_next_state, next_state),dim=1)
# b_done = torch.cat((b_done, done),dim=1)
# b_reward = torch.cat((b_reward, reward),dim=1)
# b_action = torch.cat((b_action, action),dim=1)
print("dim state:",b_state.shape)
# for h and c shape (num_layers * num_directions, batch, hidden_size)
ha0 = torch.zeros(lstm_layers, b_state.shape[0], state_dim)
ca0 = torch.zeros(lstm_layers, b_state.shape[0], state_dim)
hc0 = torch.zeros(lstm_layers, b_state.shape[0], state_dim + action_dim)
cc0 = torch.zeros(lstm_layers, b_state.shape[0], state_dim + action_dim)
# Step 5: From the next state s’, the Actor target plays the next action a’
b_next_action = self.actor_target(b_next_state, (ha0, ca0))
b_next_action = b_next_action[0]
# Step 6: We add Gaussian noise to this next action a’ and we clamp it in a range of values supported by the environment
noise = torch.Tensor(b_next_action).data.normal_(0,
policy_noise)
noise = noise.clamp(-noise_clip, noise_clip)
b_next_action = (b_next_action + noise).clamp(-self.max_action,
self.max_action)
# Step 7: The two Critic targets take each the couple (s’, a’) as input and return two Q-values Qt1(s’,a’) and Qt2(s’,a’) as outputs
result = self.critic_target(b_next_state, b_next_action, (hc0,cc0))
target_Q1 = result[0]
target_Q2 = result[1]
# Step 8: We keep the minimum of these two Q-values: min(Qt1, Qt2)
target_Q = torch.min(target_Q1, target_Q2).double()
# Step 9: We get the final target of the two Critic models, which is: Qt = r + γ * min(Qt1, Qt2), where γ is the discount factor
target_Q = b_reward + (1 - b_done) * discount * target_Q
# Step 10: The two Critic models take each the couple (s, a) as input and return two Q-values Q1(s,a) and Q2(s,a) as outputs
b_action_reshape = torch.reshape(b_action, b_next_action.shape)
result = self.critic(b_state, b_action_reshape, (hc0, cc0))
current_Q1 = result[0]
current_Q2 = result[1]
# Step 11: We compute the loss coming from the two Critic models: Critic Loss = MSE_Loss(Q1(s,a), Qt) + MSE_Loss(Q2(s,a), Qt)
critic_loss = F.mse_loss(current_Q1, target_Q) \
+ F.mse_loss(current_Q2, target_Q)
# Step 12: We backpropagate this Critic loss and update the parameters of the two Critic models with a SGD optimizer
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
# Step 13: Once every two iterations, we update our Actor model by performing gradient ascent on the output of the first Critic model
out = self.actor(b_state, (ha0, ca0))
out = out[0]
(actor_loss, hx, cx) = self.critic.Q1(b_state, out, (hc0,cc0))
actor_loss = -1 * actor_loss.mean()
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
# Step 14: Still once every two iterations, we update the weights of the Actor target by polyak averaging
for (param, target_param) in zip(self.actor.parameters(),
self.actor_target.parameters()):
target_param.data.copy_(tau * param.data + (1 - tau)
* target_param.data)
# Step 15: Still once every two iterations, we update the weights of the Critic target by polyak averaging
for (param, target_param) in zip(self.critic.parameters(),
self.critic_target.parameters()):
target_param.data.copy_(tau * param.data + (1 - tau)
* target_param.data)
首先我要说的是,强化学习通常需要大量训练。然而,这在很大程度上取决于您要解决的问题的复杂性。如果您的模型很复杂(例如,使用 LSTM 时),情况可能会变得更糟。在训练代理人玩 Atari 游戏时,您可能需要多达 100 万集(取决于游戏和使用的方法)。
关于 epochs,如果你的意思是重复使用特定情节(或情节集合)进行训练,那么这将取决于你使用的是 on 还是 off policy 方法(对于 off policy 这就像经验重播, epochs 是错误的词)。 Actor-Critic 方法通常是基于策略的,这意味着它们在每个训练阶段都需要新数据。一旦一个片段被用于训练,它就不应该被再次使用。有关 on/off 政策之间差异的更多信息,我建议您查看 Sutton's book.
我的 AI 中有 nn Actor Critic TD3 模型和 LSTM。 对于每次训练,我都会创建批量的顺序数据并训练我的 AI。
请专家帮忙告诉我是否也需要这个 AI.And 总的来说我可以用这段代码 运行 多少个 epoch,因为我在一个上创建了很多批次训练步骤是否也可以有epochs。
下面是训练步骤代码
def train(
self,
replay_buffer,
iterations,
batch_size=50,
discount=0.99,
tau=0.005,
policy_noise=0.2,
noise_clip=0.5,
policy_freq=2,
):
b_state = torch.Tensor([])
b_next_state = torch.Tensor([])
b_done = torch.Tensor([])
b_reward = torch.Tensor([])
b_action = torch.Tensor([])
for it in range(iterations):
# print ('it: ', it, ' iterations: ', iterations)
# Step 4: We sample a batch of transitions (s, s’, a, r) from the memory
(batch_states, batch_next_states, batch_actions,
batch_rewards, batch_dones) = \
replay_buffer.sample(batch_size)
batch_states = batch_states.astype(float)
batch_next_states = batch_next_states.astype(float)
batch_actions = batch_actions.astype(float)
batch_rewards = batch_rewards.astype(float)
batch_dones = batch_dones.astype(float)
state = torch.from_numpy(batch_states)
next_state = torch.from_numpy(batch_next_states)
action = torch.from_numpy(batch_actions)
reward = torch.from_numpy(batch_rewards)
done = torch.from_numpy(batch_dones)
b_size = 1
seq_len = state.shape[0]
batch = b_size
input_size = state_dim
state = torch.reshape(state, ( 1,seq_len, state_dim))
next_state = torch.reshape(next_state, ( 1,seq_len, state_dim))
done = torch.reshape(done, ( 1,seq_len, 1))
reward = torch.reshape(reward, ( 1, seq_len, 1))
action = torch.reshape(action, ( 1, seq_len, action_dim))
b_state = torch.cat((b_state, state),dim=0)
b_next_state = torch.cat((b_next_state, next_state),dim=0)
b_done = torch.cat((b_done, done),dim=0)
b_reward = torch.cat((b_reward, reward),dim=0)
b_action = torch.cat((b_action, action),dim=0)
# state = torch.reshape(state, (seq_len, 1, state_dim))
# next_state = torch.reshape(next_state, (seq_len, 1,
# state_dim))
# done = torch.reshape(done, (seq_len, 1, 1))
# reward = torch.reshape(reward, (seq_len, 1, 1))
# action = torch.reshape(action, (seq_len, 1, action_dim))
# b_state = torch.cat((b_state, state),dim=1)
# b_next_state = torch.cat((b_next_state, next_state),dim=1)
# b_done = torch.cat((b_done, done),dim=1)
# b_reward = torch.cat((b_reward, reward),dim=1)
# b_action = torch.cat((b_action, action),dim=1)
print("dim state:",b_state.shape)
# for h and c shape (num_layers * num_directions, batch, hidden_size)
ha0 = torch.zeros(lstm_layers, b_state.shape[0], state_dim)
ca0 = torch.zeros(lstm_layers, b_state.shape[0], state_dim)
hc0 = torch.zeros(lstm_layers, b_state.shape[0], state_dim + action_dim)
cc0 = torch.zeros(lstm_layers, b_state.shape[0], state_dim + action_dim)
# Step 5: From the next state s’, the Actor target plays the next action a’
b_next_action = self.actor_target(b_next_state, (ha0, ca0))
b_next_action = b_next_action[0]
# Step 6: We add Gaussian noise to this next action a’ and we clamp it in a range of values supported by the environment
noise = torch.Tensor(b_next_action).data.normal_(0,
policy_noise)
noise = noise.clamp(-noise_clip, noise_clip)
b_next_action = (b_next_action + noise).clamp(-self.max_action,
self.max_action)
# Step 7: The two Critic targets take each the couple (s’, a’) as input and return two Q-values Qt1(s’,a’) and Qt2(s’,a’) as outputs
result = self.critic_target(b_next_state, b_next_action, (hc0,cc0))
target_Q1 = result[0]
target_Q2 = result[1]
# Step 8: We keep the minimum of these two Q-values: min(Qt1, Qt2)
target_Q = torch.min(target_Q1, target_Q2).double()
# Step 9: We get the final target of the two Critic models, which is: Qt = r + γ * min(Qt1, Qt2), where γ is the discount factor
target_Q = b_reward + (1 - b_done) * discount * target_Q
# Step 10: The two Critic models take each the couple (s, a) as input and return two Q-values Q1(s,a) and Q2(s,a) as outputs
b_action_reshape = torch.reshape(b_action, b_next_action.shape)
result = self.critic(b_state, b_action_reshape, (hc0, cc0))
current_Q1 = result[0]
current_Q2 = result[1]
# Step 11: We compute the loss coming from the two Critic models: Critic Loss = MSE_Loss(Q1(s,a), Qt) + MSE_Loss(Q2(s,a), Qt)
critic_loss = F.mse_loss(current_Q1, target_Q) \
+ F.mse_loss(current_Q2, target_Q)
# Step 12: We backpropagate this Critic loss and update the parameters of the two Critic models with a SGD optimizer
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
# Step 13: Once every two iterations, we update our Actor model by performing gradient ascent on the output of the first Critic model
out = self.actor(b_state, (ha0, ca0))
out = out[0]
(actor_loss, hx, cx) = self.critic.Q1(b_state, out, (hc0,cc0))
actor_loss = -1 * actor_loss.mean()
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
# Step 14: Still once every two iterations, we update the weights of the Actor target by polyak averaging
for (param, target_param) in zip(self.actor.parameters(),
self.actor_target.parameters()):
target_param.data.copy_(tau * param.data + (1 - tau)
* target_param.data)
# Step 15: Still once every two iterations, we update the weights of the Critic target by polyak averaging
for (param, target_param) in zip(self.critic.parameters(),
self.critic_target.parameters()):
target_param.data.copy_(tau * param.data + (1 - tau)
* target_param.data)
首先我要说的是,强化学习通常需要大量训练。然而,这在很大程度上取决于您要解决的问题的复杂性。如果您的模型很复杂(例如,使用 LSTM 时),情况可能会变得更糟。在训练代理人玩 Atari 游戏时,您可能需要多达 100 万集(取决于游戏和使用的方法)。
关于 epochs,如果你的意思是重复使用特定情节(或情节集合)进行训练,那么这将取决于你使用的是 on 还是 off policy 方法(对于 off policy 这就像经验重播, epochs 是错误的词)。 Actor-Critic 方法通常是基于策略的,这意味着它们在每个训练阶段都需要新数据。一旦一个片段被用于训练,它就不应该被再次使用。有关 on/off 政策之间差异的更多信息,我建议您查看 Sutton's book.