为什么我的值字典需要浅拷贝才能正确更新?

Why is shallow copy needed for my values dictionary to correctly update?

我正在 Agent class Python 2.7.11 中使用 Markov Decision Process (MDP) 来搜索最优策略 πGridWorld中。我正在使用以下 Bellman 方程对所有 GridWorld 状态的 100 次迭代实施基本值迭代:

这个等式是通过取 Q 值函数的最大值得出的,这是我在我的程序中使用的:

在构造我的 Agent 时,它传递了一个 MDP,它是一个抽象 class 包含以下方法:

# Returns all states in the GridWorld
def getStates()

# Returns all legal actions the agent can take given the current state
def getPossibleActions(state)

# Returns all possible successor states to transition to from the current state 
# given an action, and the probability of reaching each with that action
def getTransitionStatesAndProbs(state, action)

# Returns the reward of going from the current state to the successor state
def getReward(state, action, nextState)

我的 Agent 也通过了折扣因子和多次迭代。我还使用 dictionary 来跟踪我的价值观。这是我的代码:

class IterationAgent:

    def __init__(self, mdp, discount = 0.9, iterations = 100):
        self.mdp = mdp
        self.discount = discount
        self.iterations = iterations
        self.values = util.Counter() # A Counter is a dictionary with default 0

        for transition in range(0, self.iterations, 1):
            states = self.mdp.getStates()
            valuesCopy = self.values.copy()
            for state in states:
                legalMoves = self.mdp.getPossibleActions(state)
                convergedValue = 0
                for move in legalMoves:
                    value = self.computeQValueFromValues(state, move)
                    if convergedValue <= value or convergedValue == 0:
                        convergedValue = value

                valuesCopy.update({state: convergedValue})

            self.values = valuesCopy

    def computeQValueFromValues(self, state, action):
        successors = self.mdp.getTransitionStatesAndProbs(state, action)
        reward = self.mdp.getReward(state, action, successors)
        qValue = 0
        for successor, probability in successors:
            # The Q value equation: Q*(a,s) = T(s,a,s')[R(s,a,s') + gamma(V*(s'))]
            qValue += probability * (reward + (self.discount * self.values[successor]))
        return qValue

此实现是正确的,但我不确定为什么我需要 valuesCopy 才能成功更新我的 self.values 字典。我尝试了以下方法来省略复制,但它 不起作用 因为它 returns 稍微不正确的值:

for i in range(0, self.iterations, 1):
    states = self.mdp.getStates()
    for state in states:
        legalMoves = self.mdp.getPossibleActions(state)
        convergedValue = 0
        for move in legalMoves:
            value = self.computeQValueFromValues(state, move)
            if convergedValue <= value or convergedValue == 0:
                convergedValue = value

        self.values.update({state: convergedValue})

我的问题是,为什么在 valuesCopy = self.values.copy() 每次迭代都会复制字典时,为什么必须包含我的 self.values 字典的副本才能正确更新我的值?不应该在同一更新中更新原始结果中的值吗?

是否拥有副本在算法上存在差异:

# You update your copy here, so the original will be used unchanged, which is not the 
# case if you don't have the copy
valuesCopy.update({state: convergedValue})

# If you have the copy, you'll be using the old value stored in self.value here, 
# not the updated one
qValue += probability * (reward + (self.discount * self.values[successor]))