为什么我的值字典需要浅拷贝才能正确更新？

Question

我正在 Agent class Python 2.7.11 中使用 Markov Decision Process (MDP) 来搜索最优策略 π在GridWorld中。我正在使用以下 Bellman 方程对所有 GridWorld 状态的 100 次迭代实施基本值迭代：

T(s,a,s')是成功转换到后继状态的概率函数s ' 从当前状态 s 通过采取行动 a.
R(s,a,s') 是从 s[=75= 过渡的奖励] 到 s'.

γ (gamma) 是折扣因子，其中 0 ≤ γ 1.

V_k(s')是递归调用重复计算一次s' 已达到。

V_k+1(s)代表how after enough k次迭代，V_k次迭代值会收敛，等价于V _k+1

这个等式是通过取 Q 值函数的最大值得出的，这是我在我的程序中使用的：

在构造我的 Agent 时，它传递了一个 MDP，它是一个抽象 class 包含以下方法：

# Returns all states in the GridWorld def getStates() # Returns all legal actions the agent can take given the current state def getPossibleActions(state) # Returns all possible successor states to transition to from the current state # given an action, and the probability of reaching each with that action def getTransitionStatesAndProbs(state, action) # Returns the reward of going from the current state to the successor state def getReward(state, action, nextState)

我的 Agent 也通过了折扣因子和多次迭代。我还使用 dictionary 来跟踪我的价值观。这是我的代码：

class IterationAgent: def __init__(self, mdp, discount = 0.9, iterations = 100): self.mdp = mdp self.discount = discount self.iterations = iterations self.values = util.Counter() # A Counter is a dictionary with default 0 for transition in range(0, self.iterations, 1): states = self.mdp.getStates() valuesCopy = self.values.copy() for state in states: legalMoves = self.mdp.getPossibleActions(state) convergedValue = 0 for move in legalMoves: value = self.computeQValueFromValues(state, move) if convergedValue <= value or convergedValue == 0: convergedValue = value valuesCopy.update({state: convergedValue}) self.values = valuesCopy def computeQValueFromValues(self, state, action): successors = self.mdp.getTransitionStatesAndProbs(state, action) reward = self.mdp.getReward(state, action, successors) qValue = 0 for successor, probability in successors: # The Q value equation: Q*(a,s) = T(s,a,s')[R(s,a,s') + gamma(V*(s'))] qValue += probability * (reward + (self.discount * self.values[successor])) return qValue

此实现是正确的，但我不确定为什么我需要 valuesCopy 才能成功更新我的 self.values 字典。我尝试了以下方法来省略复制，但它 不起作用 因为它 returns 稍微不正确的值：

for i in range(0, self.iterations, 1): states = self.mdp.getStates() for state in states: legalMoves = self.mdp.getPossibleActions(state) convergedValue = 0 for move in legalMoves: value = self.computeQValueFromValues(state, move) if convergedValue <= value or convergedValue == 0: convergedValue = value self.values.update({state: convergedValue})

我的问题是，为什么在 valuesCopy = self.values.copy() 每次迭代都会复制字典时，为什么必须包含我的 self.values 字典的副本才能正确更新我的值？不应该在同一更新中更新原始结果中的值吗？

Answer 1

是否拥有副本在算法上存在差异：

# You update your copy here, so the original will be used unchanged, which is not the 
# case if you don't have the copy
valuesCopy.update({state: convergedValue})

# If you have the copy, you'll be using the old value stored in self.value here, 
# not the updated one
qValue += probability * (reward + (self.discount * self.values[successor]))

为什么我的值字典需要浅拷贝才能正确更新？

Why is shallow copy needed for my values dictionary to correctly update?

python

python-2.7

dictionary

iteration

artificial-intelligence