强化学习之利用Q学习解决出租车问题

''' 利用Q学习解决出租车问题 '''"""智能体必须在一个位置上接上乘客并在另一个位置放下乘客。成功放下乘客，那么智能体将会得到奖励+20分，且每经过一个时间步得到-1分。如果智能体错误搭载和放下，则会得到-10分。因此，智能体的目标就是学习在最短时间内在正确的位置搭载和放下乘客，且不会搭载非法乘客。+---------+|R:...

北木.

1734人浏览 · 2020-03-16 21:05:03

北木. · 2020-03-16 21:05:03 发布

''' 利用Q学习解决出租车问题 '''

"""
智能体必须在一个位置上接上乘客并在另一个位置放下乘客。
成功放下乘客，那么智能体将会得到奖励+20分，且每经过一
个时间步得到-1分。如果智能体错误搭载和放下，则会得到
-10分。因此，智能体的目标就是学习在最短时间内在正确的
位置搭载和放下乘客，且不会搭载非法乘客。
            +---------+
            |R: | : :G|
            | : : : : |
            | : : : : |
            | | : | : |
            |Y| : |B: |
            +---------+
其中字母(R, G, Y, B)分别表示4个不同位置。
"""

import random
import gym
env = gym.make('Taxi-v1')

env.render()  # 输出出租车环境

# Now, we initialize, Q table as a dictionary which stores state-action
# pair specifying value of performing an action a in state s.
q = {}
for s in range(env.observation_space.n):
    for a in range(env.action_space.n):
        q[(s, a)] = 0.0

# We define a function called update_q_table which will update the Q values
# according to our Q learning update rule.

# If you look at the below function, we take the value which has maximum value
# for a state-action pair and store it in a variable called qa, then we update
# the Q value of the preivous state by our update rule.

def update_q_table(prev_state, action, reward, nextstate, alpha, gamma):
    qa = max([q[(nextstate, a)] for a in range(env.action_space.n)])
    q[(prev_state,action)] += alpha * (reward + gamma * qa - q[(prev_state,action)])

# Then, we define a function for performing epsilon-greedy policy. In epsilon-greedy policy,
# either we select best action with probability 1-epsilon or we explore new action with probability epsilon.

def epsilon_greedy_policy(state, epsilon):
    if random.uniform(0,1) < epsilon:
        return env.action_space.sample()
    else:
        return max(list(range(env.action_space.n)), key=lambda x: q[(state,x)])

# Now we initialize necessary variables
# alpha - TD learning rate
# gamma - discount factor
# epsilon - epsilon value in epsilon greedy policy

alpha = 0.4
gamma = 0.999
epsilon = 0.017

# Now, Let us perform Q Learning!!!!

for i in range(8000):
    r = 0
    prev_state = env.reset()

    while True:
        env.render()

        # In each state, we select the action by epsilon-greedy policy
        action = epsilon_greedy_policy(prev_state, epsilon)

        # then we perform the action and move to the next state, and receive the reward
        nextstate, reward, done, _ = env.step(action)

        # Next we update the Q value using our update_q_table function
        # which updates the Q value by Q learning update rule

        update_q_table(prev_state, action, reward, nextstate, alpha, gamma)

        # Finally we update the previous state as next state
        prev_state = nextstate

        # Store all the rewards obtained
        r += reward

        #we will break the loop, if we are at the terminal state of the episode
        if done:
            break

    print("total reward: ", r)

env.close()