循环中的变量更新错误 - Python(Q 学习)

Variable updating wrong in loop - Python (Q-learning)

为什么position和newposition给出相同的输出并在下一个循环中一起更新?

for game in range(nr_of_games):
    # Initialize the player at the start position and store the current position in position
    position=np.array([0,19])

    status = -1
    # loop over steps taken by the player
    while status == -1: #the status of the game is -1, terminate if 1 (see status_list above)

        # Find out what move to make using  
        q_in=Q[position[0],position[1]]

        
        move, action = action_fcn(q_in,epsilon,wind)
        
        # update location, check grid,reward_list, and status_list 
        
        newposition[0] = position[0] + move[0]
        newposition[1] = position[1] + move[1]
        
        print('new loop')
        print(newposition)
        print(position)
        
        
        grid_state = grid[newposition[0]][newposition[1]]
        reward = reward_list[grid_state]
        
        status = status_list[grid_state]
        status = int(status)
        
        if status == 1:
            Q[position[0],position[1],action]= reward
            break #Game over 
            
        else: Q[position[0],position[1],action]= (1-alpha)*Q[position[0],position[1],action]+alpha*(reward+gamma*Q[newposition[0],newposition[1],action])
           
        position = newposition

打印出来:

new loop
[16 26]
[16 26]
new loop
[17 26]
[17 26]
new loop
[18 26]
[18 26]
new loop
[19 26]
[19 26]
new loop
[19 25]
[19 25]
new loop
[20 25]
[20 25]

显然,有些地方你没有给我们看,你却给我们看

>>> newposition = position

所以实际上,当你递增 newposition 时,你实际上也在递增 position

所以只需让 newpositionposition 不同即可。我的意思是,让他们拥有 id(newposition) != id(position),你会很好。因为目前,我猜这两个 id 是一样的,不是吗?

Why does the position and newposition give the same output and update together in the next loop?

因为它们是同一个对象。我不是(只)说它们相等,我是说 newpositionposition,即你目前有 (newposition is position) is True.

只需独立于 position 定义 newposition。例如:

# [...]
for game in range(nr_of_games):
    # Initialize the player at the start position and store the current position in position
    position    = np.array([0,19])
    newposition = np.empty((2,))
    # [...]

此外,您可能有充分的理由这样做,但请记住,如果 moveposition 具有相同的形状并传达“相同的信息”,您也可以这样做

# [...]
    # [...]
        # [...]
        # newposition[0] = position[0] + move[0]
        # newposition[1] = position[1] + move[1]
        newposition = position + move
        # [...]

并删除 newposition = np.empty((2,)).

那是因为你试图用 = 运算符将一个列表复制到另一个列表;与列表一起使用时,它将存储在右变量中的指针分配给左变量,因此物理上指向相同的内存单元。

要真正复制列表,请使用 list.copy() 方法。