Reinforcement Learning with Q-Learning

Quick Introduction

What is Q-Learning?

The idea of Q-learning is to have these "Q values" for every action that you can possibly take in a state. Over time, we would want to update those "Q values" in such a way that running through a chain of actions will produce a good result. This is done by rewarding some kind of agent as they go through the dedicated environment and the idea is to reward the agent for the long-term goal, rather than any immediate short term goal and actions.

Q-learning specifically is called "model-free learning". The idea is that the q-learning model that we create is applicable to any environment (it is not environment specific and that the environment is simple enough).

The AI doesn't really care about the environment but it is important to you to understand the environment so that you understand how q-learning works.

Here we will be using the MountainCar environment. The task or goal of this environment will be to try to get a small car up a hill by pushing it left and right to get momentum and get up the hill.

import numpy as np
# import the gym library which has ready made different environments
import gym
# create the environment which is called MountainCar
env = gym.make("MountainCar-v0")
# need to always reset the environment before every start
env.reset()

This environment has 3 available actions: 0 - push car left, 1 - do nothing, 2 - push the car right. Normally you are able to pull info from the gym environment and check what actions are available using the code below:

print(env.observation_space.high) # all the high values of the observations possible
print(env.observation_space.low) # all the low values of the observations possible
print(env.action_space.n) # this will tell us how many actions we can take

First tests of our agent and environment

We don't know how the agent will react to various actions in this environment so we will start off simple and tell the agent to push the car right at each step.

Every time we step, we get a new state from the environment. The "state" is the things that we are "sensing from the environment", so basically what the agent experiences in the environment. In this case, the state is two values: position and velocity. But for the model, it doesn't really matter what these values actually are, as long as they are meaningful.

import numpy as np
# import the gym library which has ready made different environments
import gym
# create the environment which is called MountainCar
env = gym.make("MountainCar-v0")
# need to always reset the environment before every start
env.reset()
# we attempt to iterate through the environment
done = False
while not done:
    action = 2 # push the car right
    new_state, reward, done, _ = env.step(action) # get new state
    env.render() # show the state

env.close()

Running the above code, we can see that each time the agent, that is controlling the environment, tries pushing the car to the right with no success. So, we want to find a better way to do this.

What we want to do is to create a q-table, a large table that has any combination of states of position and velocity. By looking-up this table, we can find which values at a given position and velocity give us the largest q-value (a value that shows us that this is best choice of combinations).

The q-table is initialized with random values and the agent explores these values by choosing and trying out different values from the table. Then agent then updates the q-values based on the results it gets from trying a given combination.

import numpy as np
# import the gym library which has ready made different environments
import gym
# create the environment which is called MountainCar
env = gym.make("MountainCar-v0")
# need to always reset the environment before every start
env.reset()
# we attempt to iterate through the environment
done = False
while not done:
    action = 2 # push the car right
    new_state, reward, done, _ = env.step(action) # get new state
    print(new_state)
    env.render() # show the state

env.close()

Output exceeds the size limit. Open the full output data in a text editor

[-0.45919555  0.00052383]
[-0.45815173  0.00104381]
[-0.45659563  0.0015561 ]
[-0.45453867  0.00205696]
[-0.45199597  0.0025427 ]
[-0.44898617  0.0030098 ]
[-0.4455313   0.00345487]
[-0.44165662  0.0038747 ]
[-0.4373903   0.00426631]
[-0.43276337  0.00462693]
[-0.42780933  0.00495406]
[-0.42256382  0.00524549]
[-0.41706455  0.00549927]
[-0.41135076  0.0057138 ]
[-0.40546298  0.00588776]
[-0.39944282  0.00602018]
[-0.39333242  0.00611039]
[-0.38717437  0.00615807]
[-0.38101116  0.00616321]
[-0.37488502  0.00612612]
[-0.36883762  0.0060474 ]
[-0.36290967  0.00592795]
[-0.35714075  0.00576892]
[-0.35156903  0.00557173]
[-0.346231    0.00533802]
...
[-0.32020026 -0.00275491]
[-0.32338774 -0.00318748]
[-0.32698816 -0.00360039]
[-0.3309791  -0.00399095]
[-0.3353356  -0.00435653]

The above code, we print out the states after each step. We get A LOT of them because we get the position and velocity of the car at each second as the car moves. The 2 values are 8 digits long after the decimal place. Trying to find every single combination of 2 numbers that are 8 digits long - that is A LOT of combinations. This would mean our q-table of combinations would be huge, the table would probably not even fit our computer memory and the amount of time to try every combination would take forever.

Preparing the size of our Q-table

Simplifying the number of possible combinations

Now we want to try to create our q-table. Because there too many possible combinations, we want to convert our continuous values into discrete values and organize them into buckets (where the size of the buckets will need to be tweaked as we continue to experiment but at least to be a manageable size). The bucket size usually is not hardcoded because it would change for each environment. For now, we will set a size so that the code is at least functional at the beginning.

The bucket size will be defined as DISCRETE_OS_SIZE, which will have a size of 20 x the number of observations, where the number of observations is 2 for this environment. We do this because different environments can have more observations, so this makes the code a little more flexible.

From the printed observations above we know that the position observation value has a range from -1.2 to 0.6 and the velocity observation value has a range of -0.7 to 0.7. We now want to take each of those ranges and create 20 chunks or buckets or discrete values. To do this we take the difference between the highest value and the lowest value and divide by our bucket size which is 20.

# create the environment which is called MountainCar
env = gym.make("MountainCar-v0")
# need to always reset the environment before every start
env.reset()

print(env.observation_space.high) # all the high values of the observations possible
print(env.observation_space.low) # all the low values of the observations possible
print(env.action_space.n) # this will tell us how many actions we can take

# size of our bucket (for now), 
# size = 20 * length of observation space which is 2 giving us [20, 20]
DISCRETE_OS_SIZE = [20] * len(env.observation_space.high) 
# here we want to divide the value ranges for position and velocity into 20 buckets
discrete_os_win_size = (env.observation_space.high - env.observation_space.low) / DISCRETE_OS_SIZE

print(discrete_os_win_size)

# we attempt to iterate through the environment
done = False
while not done:
    action = 2 # push the car right
    new_state, reward, done, _ = env.step(action) # get new state
    #print(new_state)
    env.render() # show the state

env.close()

[0.6  0.07]
[-1.2  -0.07]
3
[0.09  0.007]

After creating our window size and dividing our range into 20 buckets, our position value bucket range is 0.09 long and for the velocity value bucket range is 0.007 long. In different situations, you would sometimes have to tweak the bucket size (the DISCRETE_OS_SIZE variable, per observation) to find the desired values needed for you agent to work properly. Normally you would try to create scripts to make these values more dynamic and run even a few hundred episodes to tweak these values.

Creating the Q-table

Now we can start creating our q-table. Here is an example of how a q-table would look like. The values in the table would be updated with different values as it tries out different actions. So in the beginning, the agent will be doing a lot of random actions until it starts getting some kind of rewards. And as the agent gets more rewards, the rewards will be used to retrain our neural network model so that it learns that these set of actions good ones to use. Because of this, a high q-value will be assigned to those actions that gained a reward.

NOTE: The rewards will always be a -1 until the car reaches the flag. Once it reaches the flag and completes the task, it will get a reward of 0. The actions that were taken to reach the flag will be backpropagated through the neural network to learn what actions were good to take.

Here we create our q-table and initialize it with random values. The size of the q-table is 20 x 20 x 3, so you can think of it as every combination possible environment observations (position and velocity) x 3 actions that we can take (left, do nothing, right). Inside the table, you will have your 3 random q-values assigned between -2 and 0. But, over time, through this agents exploration and exploitation, these values should get slowly tweaked and optimized (based on the Q-learning function).

import gym
import numpy as np

# create the environment which is called MountainCar
env = gym.make("MountainCar-v0")
# need to always reset the environment before every start
env.reset()

print(env.observation_space.high) # all the high values of the observations possible
print(env.observation_space.low) #
print(env.action_space.n) # this will tell us how many actions we can take

# size of our bucket (for now), 
# size = 20 * length of observation space which is 2 giving us [20, 20]
DISCRETE_OS_SIZE = [20] * len(env.observation_space.high) 
# here we want to divide the value ranges for position and velocity into 20 buckets
discrete_os_win_size = (env.observation_space.high - env.observation_space.low) / DISCRETE_OS_SIZE

print(discrete_os_win_size)

# initialize the q-table of a size of 20 x 20 x 3 (every combination of position and velocity)
# assigning random values, which will also need to be tweaked
q_table = np.random.uniform(low=-2, high=0, size=(DISCRETE_OS_SIZE + [env.action_space.n])) 
print(q_table.shape)

# we attempt to iterate through the environment
done = False
while not done:
    action = 2 # push the car right
    new_state, reward, done, _ = env.step(action) # get new state
    #print(new_state)
    env.render() # show the state

env.close()

[0.6  0.07]
[-1.2  -0.07]
3
[0.09  0.007]
(20, 20, 3)

Thanks for reading! The next part we are going to be continuing with our q-table with more detailed analysis.

Reinforcement Learning with Q-Learning - Part 1

Mangs