为什么我的值字典需要浅拷贝才能正确更新?
Why is shallow copy needed for my values dictionary to correctly update?
我正在 Agent
class Python 2.7.11 中使用 Markov Decision Process (MDP) 来搜索最优策略 π在GridWorld
中。我正在使用以下 Bellman 方程对所有 GridWorld
状态的 100 次迭代实施基本值迭代:
- T(s,a,s')是成功转换到后继状态的概率函数s ' 从当前状态 s 通过采取行动 a.
- R(s,a,s') 是从 s[=75= 过渡的奖励] 到 s'.
- γ (gamma) 是折扣因子,其中 0 ≤ γ 1.
- Vk(s')是递归调用重复计算一次s' 已达到。
- Vk+1(s)代表how after enough k次迭代,Vk次迭代值会收敛,等价于V k+1
这个等式是通过取 Q 值函数的最大值得出的,这是我在我的程序中使用的:
在构造我的 Agent
时,它传递了一个 MDP,它是一个抽象 class 包含以下方法:
# Returns all states in the GridWorld
def getStates()
# Returns all legal actions the agent can take given the current state
def getPossibleActions(state)
# Returns all possible successor states to transition to from the current state
# given an action, and the probability of reaching each with that action
def getTransitionStatesAndProbs(state, action)
# Returns the reward of going from the current state to the successor state
def getReward(state, action, nextState)
我的 Agent
也通过了折扣因子和多次迭代。我还使用 dictionary
来跟踪我的价值观。这是我的代码:
class IterationAgent:
def __init__(self, mdp, discount = 0.9, iterations = 100):
self.mdp = mdp
self.discount = discount
self.iterations = iterations
self.values = util.Counter() # A Counter is a dictionary with default 0
for transition in range(0, self.iterations, 1):
states = self.mdp.getStates()
valuesCopy = self.values.copy()
for state in states:
legalMoves = self.mdp.getPossibleActions(state)
convergedValue = 0
for move in legalMoves:
value = self.computeQValueFromValues(state, move)
if convergedValue <= value or convergedValue == 0:
convergedValue = value
valuesCopy.update({state: convergedValue})
self.values = valuesCopy
def computeQValueFromValues(self, state, action):
successors = self.mdp.getTransitionStatesAndProbs(state, action)
reward = self.mdp.getReward(state, action, successors)
qValue = 0
for successor, probability in successors:
# The Q value equation: Q*(a,s) = T(s,a,s')[R(s,a,s') + gamma(V*(s'))]
qValue += probability * (reward + (self.discount * self.values[successor]))
return qValue
此实现是正确的,但我不确定为什么我需要 valuesCopy
才能成功更新我的 self.values
字典。我尝试了以下方法来省略复制,但它 不起作用 因为它 returns 稍微不正确的值:
for i in range(0, self.iterations, 1):
states = self.mdp.getStates()
for state in states:
legalMoves = self.mdp.getPossibleActions(state)
convergedValue = 0
for move in legalMoves:
value = self.computeQValueFromValues(state, move)
if convergedValue <= value or convergedValue == 0:
convergedValue = value
self.values.update({state: convergedValue})
我的问题是,为什么在 valuesCopy = self.values.copy()
每次迭代都会复制字典时,为什么必须包含我的 self.values
字典的副本才能正确更新我的值?不应该在同一更新中更新原始结果中的值吗?
是否拥有副本在算法上存在差异:
# You update your copy here, so the original will be used unchanged, which is not the
# case if you don't have the copy
valuesCopy.update({state: convergedValue})
# If you have the copy, you'll be using the old value stored in self.value here,
# not the updated one
qValue += probability * (reward + (self.discount * self.values[successor]))
我正在 Agent
class Python 2.7.11 中使用 Markov Decision Process (MDP) 来搜索最优策略 π在GridWorld
中。我正在使用以下 Bellman 方程对所有 GridWorld
状态的 100 次迭代实施基本值迭代:
- T(s,a,s')是成功转换到后继状态的概率函数s ' 从当前状态 s 通过采取行动 a.
- R(s,a,s') 是从 s[=75= 过渡的奖励] 到 s'.
- γ (gamma) 是折扣因子,其中 0 ≤ γ 1.
- Vk(s')是递归调用重复计算一次s' 已达到。
- Vk+1(s)代表how after enough k次迭代,Vk次迭代值会收敛,等价于V k+1
这个等式是通过取 Q 值函数的最大值得出的,这是我在我的程序中使用的:
在构造我的 Agent
时,它传递了一个 MDP,它是一个抽象 class 包含以下方法:
# Returns all states in the GridWorld
def getStates()
# Returns all legal actions the agent can take given the current state
def getPossibleActions(state)
# Returns all possible successor states to transition to from the current state
# given an action, and the probability of reaching each with that action
def getTransitionStatesAndProbs(state, action)
# Returns the reward of going from the current state to the successor state
def getReward(state, action, nextState)
我的 Agent
也通过了折扣因子和多次迭代。我还使用 dictionary
来跟踪我的价值观。这是我的代码:
class IterationAgent:
def __init__(self, mdp, discount = 0.9, iterations = 100):
self.mdp = mdp
self.discount = discount
self.iterations = iterations
self.values = util.Counter() # A Counter is a dictionary with default 0
for transition in range(0, self.iterations, 1):
states = self.mdp.getStates()
valuesCopy = self.values.copy()
for state in states:
legalMoves = self.mdp.getPossibleActions(state)
convergedValue = 0
for move in legalMoves:
value = self.computeQValueFromValues(state, move)
if convergedValue <= value or convergedValue == 0:
convergedValue = value
valuesCopy.update({state: convergedValue})
self.values = valuesCopy
def computeQValueFromValues(self, state, action):
successors = self.mdp.getTransitionStatesAndProbs(state, action)
reward = self.mdp.getReward(state, action, successors)
qValue = 0
for successor, probability in successors:
# The Q value equation: Q*(a,s) = T(s,a,s')[R(s,a,s') + gamma(V*(s'))]
qValue += probability * (reward + (self.discount * self.values[successor]))
return qValue
此实现是正确的,但我不确定为什么我需要 valuesCopy
才能成功更新我的 self.values
字典。我尝试了以下方法来省略复制,但它 不起作用 因为它 returns 稍微不正确的值:
for i in range(0, self.iterations, 1):
states = self.mdp.getStates()
for state in states:
legalMoves = self.mdp.getPossibleActions(state)
convergedValue = 0
for move in legalMoves:
value = self.computeQValueFromValues(state, move)
if convergedValue <= value or convergedValue == 0:
convergedValue = value
self.values.update({state: convergedValue})
我的问题是,为什么在 valuesCopy = self.values.copy()
每次迭代都会复制字典时,为什么必须包含我的 self.values
字典的副本才能正确更新我的值?不应该在同一更新中更新原始结果中的值吗?
是否拥有副本在算法上存在差异:
# You update your copy here, so the original will be used unchanged, which is not the
# case if you don't have the copy
valuesCopy.update({state: convergedValue})
# If you have the copy, you'll be using the old value stored in self.value here,
# not the updated one
qValue += probability * (reward + (self.discount * self.values[successor]))