data:image/s3,"s3://crabby-images/16310/16310ea9cc98af3ccdf7c45cab0cd378c4747341" alt="Python Reinforcement Learning"
Deriving the Bellman equation for value and Q functions
Now let us see how to derive Bellman equations for value and Q functions.
You can skip this section if you are not interested in mathematics; however, the math will be super intriguing.
First, we define, as a transition probability of moving from state
to
while performing an action a:
data:image/s3,"s3://crabby-images/64aa3/64aa344a6d54c6eb10e06ba304a7999ccd71ae7e" alt=""
We define as a reward probability received by moving from state
to
while performing an action a:
data:image/s3,"s3://crabby-images/b8b59/b8b59211d6575e99ea565238464e36330e2717c7" alt=""
data:image/s3,"s3://crabby-images/a5d9a/a5d9ad58bc9163631610fb914cc85de91256f00b" alt=""
We know that the value function can be represented as:
data:image/s3,"s3://crabby-images/0a8df/0a8dfe7ebdd0c247e2dabc57da3f96c248f735bd" alt=""
data:image/s3,"s3://crabby-images/fb7cb/fb7cb496a444cc29f0ad9929f9780680f3237b7e" alt=""
We can rewrite our value function by taking the first reward out:
data:image/s3,"s3://crabby-images/eaa7e/eaa7eff3c55541bd6b9ffe84deb273930f185956" alt=""
The expectations in the value function specifies the expected return if we are in the state s, performing an action a with policy π.
So, we can rewrite our expectation explicitly by summing up all possible actions and rewards as follows:
data:image/s3,"s3://crabby-images/a1992/a1992686df9725a48ab67d3c3f6ee9a0be1d7cf9" alt=""
In the RHS, we will substitute from equation (5) as follows:
data:image/s3,"s3://crabby-images/e8a0e/e8a0e532bceafd060dfa3808505fdfbdb57929f7" alt=""
Similarly, in the LHS, we will substitute the value of rt+1 from equation (2) as follows:
data:image/s3,"s3://crabby-images/1dd7e/1dd7e1a6f9f25638f620ca2dd6b9a020852ca386" alt=""
So, our final expectation equation becomes:
data:image/s3,"s3://crabby-images/b09e9/b09e953fd3473dfd778cbe520c0226759490e8eb" alt=""
Now we will substitute our expectation (7) in value function (6) as follows:
data:image/s3,"s3://crabby-images/11e6a/11e6adacdb49e0abb23258c9f3ed554c3b674d1d" alt=""
Instead of , we can substitute
with equation (6), and our final value function looks like the following:
data:image/s3,"s3://crabby-images/a3602/a3602dccafe2086a97ef2f36a913cba29bff8984" alt=""
In very similar fashion, we can derive a Bellman equation for the Q function; the final equation is as follows:
data:image/s3,"s3://crabby-images/2f284/2f28421ffd203a5b466a6a1cc3742956f7ac8108" alt=""
Now that we have a Bellman equation for both the value and Q function, we will see how to find the optimal policies.