October 2018

Volume 33 Number 10

Artificially Intelligent - Introduction to Reinforcement Learning

By Frank La | October 2018

Frank La VigneIn previous articles, I’ve mentioned both supervised learning and unsupervised learning algorithms. Beyond these two methods of machine learning lays another type: Reinforcement Learning (RL). Formally defined, RL is a computational approach to goal-oriented learning through interaction with the environment under ideal learning conditions.

Like other aspects of AI, many of the algorithms and approaches actively used today trace their origins back to the 1980s (bit.ly/2NZP177). With the advent of inexpensive storage and on-demand compute power, reinforcement learning techniques have re-emerged.

Arguably the most famous use of RL to date is the AlphaGo Zero program from DeepMind Technologies Ltd. Unlike AlphaGo, which learned from records of both amateur and professional human players, AlphaGo Zero had no access to human-based training datasets. Instead, the system learned entirely from playing games against itself. After several weeks, AlphaGo zero achieved master-level status and, after 40 days, became the best Go player in the world. Keep in mind that it did all this with no human inter­vention. More information about the project is at bit.ly/2K9NpW8. RL has also been used to train algorithms to play classic arcade games and achieve the highest possible scores with little or no human participation.

In this article, I’ll discuss the foundations of RL and build upon this in future articles.

Core Components of an RL Model

RL focuses on training a model using a reward system in a manner very similar to how animals and humans learn based on experience—by trial and error. Because RL mimics human cognitive behavioral systems, there’s quite a lot of overlap with cognitive psychology. RL algorithms also deploy a fair bit of game-theory mathematics and strategies, as well.

The first two elements in any RL model are the environment and the agent. Using the example of a game of chess, the environment is the game board and the agent is the player. The state of the environment is the placement of the pieces on the chessboard. The agent observes the state of the environment and applies an action to the environment to receive a reward (see Figure 1). In the game of chess, this action could be moving a pawn forward or capturing the opponent’s piece. Any move in a game of chess automatically changes the state of the environment. The agent may take any of the valid moves available to the player in a game of chess.

Essential Components to a Reinforcement Learning System
Figure 1 Essential Components to a Reinforcement Learning System

The agent receives a reward from the environment based on the action it took. The reward is a numerical value analogous to the score in a video game. Agents choose among the available options based on the anticipated reward. Agents are programmed to maximize their reward. For example, moving a piece forward may yield a reward of 1, while capturing an enemy piece may yield a reward of 5. Given these choices, the algorithm will choose capturing an enemy piece over simply moving forward every time. Such a reward structure will yield an aggressive agent, as it will always choose capturing an opponent’s piece, even at the expense of losing the game. Clearly, this is not the optimal outcome. Therefore, the accuracy of the environment and reward structure are critical to the success of a model trained via RL.

A Smaller Problem Space

Building a model that captures the complexity of chess to train an agent falls far outside the scope of this column. For simplicity, I will reduce the environment down to a 4x4 grid. The agent will control the movement of a character in this grid. Some cells of the grid are walkable, while others lead to the agent falling into the water and, thus, killing the character and resetting the simulation. The agent has a goal cell to reach and will be rewarded with a score of 1 for finding a safe path to it. Moving around the board returns a reward value of 0. For this example, I’ll use the Frozen Lake environment from OpenAI Gym (bit.ly/2v1csWA).

To get started, open a Python 3 Jupyter notebook on your preferred platform. I covered how to get started with Jupyter notebooks in an earlier column (msdn.com/magazine/mt829269). Create an empty cell and enter and execute the following code to import the appropriate libraries and set up the environment for the agent in which to train:

import gym
import numpy as np
env = gym.make('FrozenLake-v0')

Now, create a new cell, enter the following code, and then execute:

print("observation_space")
print(env.observation_space)
print("action_space")
print(env.action_space)

The result should indicate that there are 16 discrete values in the observation space, just what you’d expect in a 4x4 grid. It also shows that there are four discrete values in the action space, meaning that the agent can move up, down, left or right.

A Random Approach

Now that the environment is set up, it’s time to start interacting with it. Add the following code to a new cell below the current one:

env.reset()
done = False
while done == False:
  action = env.action_space.sample()
  observation, reward, done, info = env.step(action)
  print(action, observation, reward, done, info)

The output should look like the following:

2 4 0.0 False {'prob': 0.3333333333333333}
3 4 0.0 False {'prob': 0.3333333333333333}
2 8 0.0 False {'prob': 0.3333333333333333}
0 4 0.0 False {'prob': 0.3333333333333333}
1 5 0.0 True {'prob': 0.3333333333333333}

This algorithm randomly explores the space by picking a direction from the action space at random. The step function accepts the action as a parameter and returns the observation, reward, done and info. Observation, in this case, represents the location on the grid. Reward is the amount of reward the step generated. Done is a Boolean value indicating whether the simulation ended. In this example, it means that our player has fallen through the ice and we need to restart the simulation. The final outbound parameter is info and is primarily used for diagnostic purposes. Given that zero reward points were earned means that the agent didn’t reach the goal.

Of course, it’s highly unlikely that the agent will just randomly come across the solution. Perhaps trying multiple times will yield a successful outcome. The following code, which should be entered and executed, will run the simulation 100 times:

print("starting")
for i in range(100):
    env.reset()
    done = False
    while done == False:
        action = env.action_space.sample()
      observation, reward, done, info = env.step(action)
      if(reward > 0):
        print("success")
        print(reward)
print ("done")

This, too, will not likely yield a successful outcome. Alter the code so that the simulation runs 1,000 times and run again. With 1,000 runs, there should be around 10 successful runs, give or take. Statistically speaking, this means that a random approach to the problem works about 1 percent of the time. Clearly, the random approach alone will not suffice.

Introducing Q-Learning

As you’ve seen, the random “try and fail” approach yields terrible results. Or does it? In higher-order animals, memory plays a crucial role in learning. A try-and-fail approach will be far more successful if the results from previous attempts are captured. The previous code starts anew every iteration, effectively relying on chance for success.

One particularly useful approach is Q-Learning. The Q-Learning method learns a strategy to guide an agent on which action to take under what circumstances. In its simplest configuration, the Q-Learning algorithm may be implemented as a table, called a Q-Table. In this Frozen Lake environment, there are 16 possible states, representing a location on the 4x4 grid and four possible actions for each direction. This results in a 16x4 table of Q-values. Each row of the table has four columns, the highest value on the row represents the best action to take. Therefore, the table contains rows that represent every possible state with columns indicating which direction has the highest likelihood of success.

Enter the code in Figure 2 into a blank cell and execute (it should take a moment to run).

Figure 2 Implementing the Q-Table

Qtable = np.zeros([env.observation_space.n,env.action_space.n])
lr = .8
gamma = .95
num_episodes = 2000
rList = []
for i in range(num_episodes):
  s = env.reset()
  rAll = 0
  d = False
  j = 0
  while j < 99:
    j+=1
    a = np.argmax(Qtable[s,:] + np.random.randn(1,env.action_space.n)*(1./(i+1)))
    s1,r,d,_ = env.step(a)
    Qtable[s,a] = Qtable[s,a] + lr*(r + gamma*np.max(Qtable[s1,:]) - Qtable[s,a])
    rAll += r
    s = s1
    if d == True:
      break
  rList.append(rAll)

To create a Q-Table, first create a table with 16 rows and four columns. Recall that there will be one row for every possible state and one column for every possible action. The Q-table is initialized with values of zero in every cell, a task made easy using the zeros method. As the agent explores the environment and observes the results, the Q-Table gets updated. At each step of the simulation, the code selects the action (column) with the highest value for the given state (row), after which the code observes the rewards earned for various actions and then updates the table accordingly. The more the code runs the simulation, the more experience is stored in the Q-Table. Eventually, the values converge on an ideal and this results in a more accurate model. The code in Figure 2 will run the simulation 2,000 times.

While much of the code is fairly straightforward and matches the description just provided, you may find the following line from the code snippet somewhat puzzling:

Qtable[s,a] = Qtable[s,a] + lr*(r + gamma*np.max(Qtable[s1,:]) - Qtable[s,a])

This code implements the Bellman Equation and stores the results in the Q-Table. A full discussion of the Bellman Equation falls outside of the scope of this column. For now, understand that the Bellman Equation represents the rewards associated with a sequence of actions. A complete walk-through of this equation may be found on YouTube at bit.ly/2n1eEc5.

Enter the following code in a new cell and execute it:

print( "score: " +  str(sum(rList)/num_episodes) )

The resulting score is an average of the rewards earned over 2,000 iterations and roughly equates to accuracy. This score should fall within a range of 0.42 and 0.54, meaning that the algorithm will successfully navigate its way through the environment somewhere between 42 percent and 54 percent of the time. Not perfect, but certainly better than the 1 percent success rate produced by random guessing with no memory of past actions.

Examining the Q-Table More Closely

Finally, it may help to take a closer look at the Q-Table. Enter “Qtable” into an empty cell and execute it. The output will be a 16x4 table of numbers between zero and one. Notice that some rows are entirely filled with zeros. Enter “Qtable[1,:]” into an empty cell and execute the code. The output will look like this:

array([0.00081276, 0.00239483, 0.00018525, 0.15080315])

This result means that the agent learned that the fourth action (going up) is the most likely path to reward when the environment is in state one. Rows filled with zeros indicate that the algorithm never filled in the value and it stayed at the default. Such a case would happen when the iteration ended, as the agent either fell into a hole or reached the goal.

Wrapping Up

In this article, I explored the key components of RL and two methods to explore an environment. The agent learned how to navigate an ice field with no human guidance. Everything the agent did, it learned on its own. In both scenarios, the agent explored its environment randomly. In the second example, the agent used a Q-Table, or a look-up table, that provided guidance on what move to make based on the current state of the environment. This gave the agent a sort of memory where it could learn from past decisions. As you saw, the result was a significant boost in success rate.

RL is among the most intriguing spaces in artificial intelligence right now. As AlphaGo Zero has demonstrated, human input is no longer an essential component for machine learning. While the poten­tial applications for this technology are many, there are significant challenges to overcome before using it in any real-world project.

Special thanks to Arthur Juliani and his blog post, “Simple Reinforcement Learning with Tensorflow Part 0: Q-Learning with Tables and Neural Networks” (bit.ly/2OxySXQ), which provided an example of the Bellman equation implemented in Python.


Frank La Vigne works at Microsoft as an AI Technology Solutions Professional, where he helps companies achieve more by getting the most out of their data with analytics and AI. He also co-hosts the DataDriven podcast. He blogs regularly at FranksWorld.com and you can watch him on his YouTube channel, “Frank’s World TV” (FranksWorld.TV).

Thanks to the following Microsoft technical expert for reviewing this article: Andy Leonard


Discuss this article in the MSDN Magazine forum