强化学习之利用Q学习解决出租车问题
''' 利用Q学习解决出租车问题 '''"""智能体必须在一个位置上接上乘客并在另一个位置放下乘客。成功放下乘客,那么智能体将会得到奖励+20分,且每经过一个时间步得到-1分。如果智能体错误搭载和放下,则会得到-10分。因此,智能体的目标就是学习在最短时间内在正确的位置搭载和放下乘客,且不会搭载非法乘客。+---------+|R:...
·
''' 利用Q学习解决出租车问题 '''
"""
智能体必须在一个位置上接上乘客并在另一个位置放下乘客。
成功放下乘客,那么智能体将会得到奖励+20分,且每经过一
个时间步得到-1分。如果智能体错误搭载和放下,则会得到
-10分。因此,智能体的目标就是学习在最短时间内在正确的
位置搭载和放下乘客,且不会搭载非法乘客。
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
其中字母(R, G, Y, B)分别表示4个不同位置。
"""
import random
import gym
env = gym.make('Taxi-v1')
env.render() # 输出出租车环境
# Now, we initialize, Q table as a dictionary which stores state-action
# pair specifying value of performing an action a in state s.
q = {}
for s in range(env.observation_space.n):
for a in range(env.action_space.n):
q[(s, a)] = 0.0
# We define a function called update_q_table which will update the Q values
# according to our Q learning update rule.
# If you look at the below function, we take the value which has maximum value
# for a state-action pair and store it in a variable called qa, then we update
# the Q value of the preivous state by our update rule.
def update_q_table(prev_state, action, reward, nextstate, alpha, gamma):
qa = max([q[(nextstate, a)] for a in range(env.action_space.n)])
q[(prev_state,action)] += alpha * (reward + gamma * qa - q[(prev_state,action)])
# Then, we define a function for performing epsilon-greedy policy. In epsilon-greedy policy,
# either we select best action with probability 1-epsilon or we explore new action with probability epsilon.
def epsilon_greedy_policy(state, epsilon):
if random.uniform(0,1) < epsilon:
return env.action_space.sample()
else:
return max(list(range(env.action_space.n)), key=lambda x: q[(state,x)])
# Now we initialize necessary variables
# alpha - TD learning rate
# gamma - discount factor
# epsilon - epsilon value in epsilon greedy policy
alpha = 0.4
gamma = 0.999
epsilon = 0.017
# Now, Let us perform Q Learning!!!!
for i in range(8000):
r = 0
prev_state = env.reset()
while True:
env.render()
# In each state, we select the action by epsilon-greedy policy
action = epsilon_greedy_policy(prev_state, epsilon)
# then we perform the action and move to the next state, and receive the reward
nextstate, reward, done, _ = env.step(action)
# Next we update the Q value using our update_q_table function
# which updates the Q value by Q learning update rule
update_q_table(prev_state, action, reward, nextstate, alpha, gamma)
# Finally we update the previous state as next state
prev_state = nextstate
# Store all the rewards obtained
r += reward
#we will break the loop, if we are at the terminal state of the episode
if done:
break
print("total reward: ", r)
env.close()
点击阅读全文
更多推荐
所有评论(0)