Reinforcement Learning with TensorFlow
上QQ阅读APP看书,第一时间看更新

The Q-learning approach to reinforcement learning

Q-learning is an attempt to learn the value Q(s,a) of a specific action given to the agent in a particular state. Consider a table where the number of rows represent the number of states, and the number of columns represent the number of actions. This is called a Q-table. Thus, we have to learn the value to find which action is the best for the agent in a given state.

Steps involved in Q-learning:

  1. Initialize the table of Q(s,a) with uniform values (say, all zeros).

  2. Observe the current state, s

  3. Choose an action, a, by epsilon greedy or any other action selection policies, and take the action

  4. As a result, a reward, r, is received and a new state, s', is perceived

  5. Update the Q value of the (s,a) pair in the table by using the following Bellman equation:

, where is the discounting factor
  1. Then, set the value of current state as a new state and repeat the process to complete one episode, that is, reaches the terminal state

  2. Run multiple episodes to train the agent

To simplify, we can say that the Q-value for a given state, s, and action, a, is updated by the sum of current reward, r, and the discounted () maximum Q value for the new state among all its actions. The discount factor delays the reward from the future compared to the present rewards. For example, a reward of 100 today will be worth more than 100 in the future. Similarly, a reward of 100 in the future must be worth less than 100 today. Therefore, we will discount the future rewards. Repeating this update process continuously results in Q-table values converging to accurate measures of the expected future reward for a given action in a given state.

When the volume of the state and action spaces increase, maintaining a Q-table is difficult. In the real world, the state spaces are infinitely large. Thus, there's a requirement of another approach that can produce Q(s,a) without a Q-table. One solution is to replace the Q-table with a function. This function will take the state as the input in the form of a vector, and output the vector of Q-values for all the actions in the given state. This function approximator can be represented by a neural network to predict the Q-values. Thus, we can add more layers and fit in a deep neural network for better prediction of Q-values when the state and action space becomes large, which seemed impossible with a Q-table. This gives rise to the Q-network and if a deeper neural network, such as a convolutional neural network, is used then it results in a deep Q-network (DQN).

More details on Q-learning and deep Q-networks will be covered in Chapter 5, Q-Learning and Deep Q-Networks.