DOUBLE DQN 没有任何意义
DOUBLE DQN doesn't make any sense
为什么使用 2 个网络,每集训练一次并每 N 集更新目标网络,而我们可以使用 1 个网络并每 N[训练一次=14=] 插曲!完全没有区别!
您描述的是 不是 Double DQN. The periodically updated target network is a core feature of the original DQN algorithm (and all of its derivatives). DeepMind's classic paper 解释了为什么拥有两个网络至关重要:
The second modification to online Q-learning aimed at further improving the
stability of our method with neural networks is to use a separate network for generating the targets y_j
in the Q-learning update. More precisely, every C
updates we
clone the network Q
to obtain a target network Q^
and use Q^
for generating the
Q-learning targets y_j
for the following C
updates to Q
. This modification makes the algorithm more stable compared to standard online Q-learning, where an update
that increases Q(s_t, a_t)
often also increases Q(s_{t+1}, a)
for all a
and hence also increases the target y_j
, possibly leading to oscillations or divergence of the policy. Generating the targets using an older set of parameters adds a delay between the time an update to Q
is made and the time the update affects the targets y_j
, making divergence or oscillations much more unlikely.
为什么使用 2 个网络,每集训练一次并每 N 集更新目标网络,而我们可以使用 1 个网络并每 N[训练一次=14=] 插曲!完全没有区别!
您描述的是 不是 Double DQN. The periodically updated target network is a core feature of the original DQN algorithm (and all of its derivatives). DeepMind's classic paper 解释了为什么拥有两个网络至关重要:
The second modification to online Q-learning aimed at further improving the stability of our method with neural networks is to use a separate network for generating the targets
y_j
in the Q-learning update. More precisely, everyC
updates we clone the networkQ
to obtain a target networkQ^
and useQ^
for generating the Q-learning targetsy_j
for the followingC
updates toQ
. This modification makes the algorithm more stable compared to standard online Q-learning, where an update that increasesQ(s_t, a_t)
often also increasesQ(s_{t+1}, a)
for alla
and hence also increases the targety_j
, possibly leading to oscillations or divergence of the policy. Generating the targets using an older set of parameters adds a delay between the time an update toQ
is made and the time the update affects the targetsy_j
, making divergence or oscillations much more unlikely.