Skip to content

Commit

Permalink
update blogs and about
Browse files Browse the repository at this point in the history
  • Loading branch information
jxx123 committed Jul 14, 2024
1 parent 7888d62 commit cb80475
Show file tree
Hide file tree
Showing 2 changed files with 209 additions and 1 deletion.
199 changes: 199 additions & 0 deletions _posts/2024-07-12-rl-notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
---
title: Reinforcement Learning Review Notes
tags: machine_learning, reinforcemen_learning
date: 2024-07-12
categories: [Notes, MachineLearning]
math: true
mermaid: true
---

It has been ~4 years since I read through [Sutton's Reinforcement Learning textbook](http://incompleteideas.net/book/RLbook2020.pdf). It is time to review it especially in this post-ChatGPT eura, and see there are any "温故而知新" moments.

# $k$-arm Bandit Problem

Choose an action between $k$ options, and you will receive a numerical reward depending on your action, and the reward is sampled from a stationary probability distribution. You want to maximize your reward over certain time period.

- No state-action transition at all.

$\epsilon$-greedy: perform greedy actions (the action that gets the max immediate reward) most of the time, but select a random action with a probability of $\epsilon$.

# Markov Decision Process (MDP)

```mermaid
flowchart
Environment -- State, Reward --> Agent
Agent -- Action --> Environment
```

Define State $S_t$, Action $A_t$, Reward $R_t$, then a _trajectory_ is

$$S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, ...$$

A model for this MDP could be a probability distribution of next state $s'$ and reward $r$, given current state $s$ and action $a$:

$$p(s', r|s, a)$$

# Bellman Equation

We want to maximize the total future reward at time $t$, and we call the total future reward as Return $G_t$.

$$G_t = R_{t+1} + R_{t+2} + ... $$

$G_t$ will go infinity. To make it converge, let's introduce a discount factor $0 < \gamma < 1$.

$$G_t = R_{t+1} + \gamma R_{t+2} + ... = \sum_{k=0} \gamma^k R_{t+k+1} = R_{t+1} + \gamma G_{t+1}$$

The value function of a state $s$ under a policy $\pi$:

$$v_\pi(s) = E_\pi[G_t|S_t=s]$$

$$ = E_\pi[R_{t+1} + \gamma G\_{t+1}|S_t=s]$$

$$ = E_{s' \sim P, a \sim \pi}[r(s, a) + \gamma v_\pi(s')]$$

$$ = \sum_{a} \pi(a|s) \sum_{s', r} p(s', r|s, a) [r + \gamma v_\pi(s')]$$

The above is called Bellman Equation.

Consider $q_\pi(s, a)$ as the expected return if the agent take action $a$ at state $s$, then follow policy $\pi$ in the future steps.

We can also come up with Bellman Equation for $q_\pi(s, a)$.

$$q_\pi(s, a) = E_\pi [G_t | S_t = s, A_t = a]$$

$$ = E_{s' \sim P}[r(s, a) + \gamma E_{a \sim \pi}[q_\pi(s', a)]]$$

$$ = \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]$$

$$ = \sum_{s', r} p(s', r | s, a) [r + \gamma \sum_{a} \pi(a | s') q_\pi(s', a)]$$

Optimal value function and Q function:

$$
\begin{aligned}
v^*(s) & = \max_\pi v_\pi(s) \\
& = \max_a q_{\pi^*}(s, a) \\
& = \max_a \sum_{s', r} p(s', r|s, a)[r + \gamma v^*(s')]
\end{aligned}
$$

$$q^*(s, a) = \max_\pi q_\pi(s, a)$$

$$ = E[R_{t+1} + \gamma v^*(S_{t+1}) | S_t = s, A_t = a]$$

$$ = E_{s' \sim P} [r(s, a) + \gamma \max_{a'} q^*(s', a')]$$

# Dynamic Programming

DP refers to a collection of algorithms that can be used to compute optimal policies given a **perfect** model of the environment.

## Policy Evaluation (Prediction)

Evaluate the state-value function $v_\pi$ for a given policy $\pi$. How good the policy $\pi$ is at each state.

$$
\begin{aligned}
v_\pi(s) & = E_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t=s] \\
& = \sum_a \pi(a | s) \sum_{s', r}p(s', r | s, a) [r + \gamma v_\pi(s')]
\end{aligned}
$$

Note that $\pi(a\|s)$ and $p(s', r \| s, a)$ is known.

We can start with a random guess of $v$, and keep updating $v$ with its current value using Bellman Equation.

$$v_{k + 1} = \sum_a \pi(a | s) \sum_{s', r}p(s', r | s, a) [r + \gamma v_k(s')]$$

$v_k$ can be proved to converge $v_\pi$ according to fixed point theorem.

Then we can use a iterative policy evalution process to estimate $v$ by doing a _sweep_ of all the states:

> For all the $s$ in the state space,
> start with a random guess of $v(s)$,
> keep updating $v$ using the above equation until it converges.
## Policy Improvement

$$q_\pi(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]$$

Consider a new policy $\pi'$ that selects $a$ such that $q_\pi(s, \pi'(s)) \ge v_\pi(s)$, then we can prove $v_{\pi'} \ge q_\pi(s, \pi'(s)) \ge v_\pi(s)$.

Hence $\pi'$ is not a worse policy than $\pi$.

It is a straightforward choice to go greedy, and we are guaranteed to get a new policy that is not worse than the previous one.

$$
\begin{aligned}
\pi'(s) & = \arg \max_a q_\pi(s, a) \\
& = \arg \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_\pi(s')]
\end{aligned}
$$

If $\pi'$ stops improving, $v_{\pi'}(s) = v_\pi(s)$, then

$$\pi'(s) = \arg \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_{\pi'}(s')]$$

Then Bellman optimality is achieved, $\pi'(s) = \pi^*(s)$

## Policy Iteration
Starting with a random policy $\pi_0$, evaluate its value $v_{\pi_0}$, then improve the policy to $\pi_1$ by greedy update, then evaluate it again, until the policy does not improve any more.

$$\pi_0 \rightarrow v_{\pi_0} \rightarrow \pi_1 \rightarrow v_{\pi_1} \rightarrow ... \rightarrow \pi^* \rightarrow v^*$$

## Value Iteration
In fact, we don't need to do multiple sweeps in the policy evaluation, and we are still guaranteed to converge to $v^*$ by just doing one sweep, this is called _value_ _iteration_.

By combining the evaluation and improvement steps, we get

$$v_{k+1}(s) = \max_a \sum_{s', r} p(s', r|s, a) [r + \gamma v_k(s')]$$

Note that this update equation is the same as the Bellman optimality equation for the value function.

## Async DP
Doing a sweep for all the states could be computationally intractable. Some game could have $10^{20}$ states.

We don't need to update the value of all the states at each step. For example, we could just update one state, $s_k$, on each step $k$, using the value iteration equation above. We are able to converge to $v^*$ if **all states** occur in the sequence ${s_k}$ infinite many times.

We can select the states to which we apply updates. **This is an interesting research area.**

We can run iterative DP at the same time that an agent is acutally experiencing the MDP.

## Generalized Policy Iteration (GPI)

```mermaid
flowchart
Policy -- Evaluation --> Value
Value -- Improvement, greedy(V) --> Policy
```
Let policy-evaluation and policy-improvement process interact. Almost all reinforcement learning methods are well described as GPI.

In fact, these 2 processes are competing and cooperating with each other. Greedy policy improvements with make the value estimation less accurate. Better value estimation makes the policy less greedy. Finally they reach a single joint solution, optimal value function and optimal policy.

# Monte Carlo Methods

Now we do not assume we have the complete knowledge of the environment, but we can learn from the experience, i.e. sample sequences of states, actions and rewards from actual or simulated interactions with an environment.

## MC Prediction

Predict the value function of a policy $\pi$:
- Generate A LOT of sequences
- Find the first visit (or every-visit) of state $s$, then compute and save the Return.
- Compute the average return of $s$.

## MC estimation of action-value function

We are actually more interested in computing $q_\pi(s, a)$, since the model is unknown to us, knowing $v^*$ is not sufficient for us to get the policy.

The poblem is many state-action pairs may never be visited. How do we **maintain exploration**?
- One way to do so, is by specifying the episodes start in a state-action pair, and that every pair has a nonzero probability of being selected as the start. We call this assumption as **exploring starts**.

## On-Policy v.s. Off-Policy
On-Policy: evaluate or improve the policy that will be used to make optimal decisions. But we do $\epsilon$-greedy for exploration.

Off-Policy: the policy we evaluate and improve is different from the one that generates the data. The policy being learned is _target_ _policy $\pi$,and the policy used to generate behavior is _behavior_ _policy_ $b$. Learn from data generated by a conventional non-learning controller or a human expert.

We require that every action taken under $\pi$ is also taken, at least occasionally, under $b$. This is called the assumption of **coverage**.

### Importance Sampling
11 changes: 10 additions & 1 deletion _tabs/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,15 @@ order: 4

My name is Jinyu Xie. I am a senior software engineer working on Google Search ranking algorithms, and a Ph.D. in Control Systems and Artificial Intelligence.

Here is my blog site to keep notes in case I forgot.
This is my blog site to keep my learning notes.

<!-- [My Google Scholar](https://scholar.google.com/citations?user=m-vkma0AAAAJ&hl=en&oi=ao). -->


I love building tools. Whenever I found there are no tools that fit my use case nicely, I will start building my own. Here are some of the tools I built:

- [loopquest.ai](https://www.loopquest.ai/), a MLOps Web App for Embodied AI evaluation. I am using it heavily for my offline reinforcemen learning workflows, and I host this on the cloud in case others would like to use it as well.
- [simglucose](https://github.com/jxx123/simglucose/tree/master), a Type-1 Diabetes Simulaotr with OpenAI Gym interface. It seems to be the most popular open-source type-1 diabetes simulator? Here are the [google scholar search results](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=simglucose&btnG=).
- [fireTS](https://github.com/jxx123/fireTS) a multi-variate time series prediction library. There is no python packages that are as usable as [the Matlab one](https://www.mathworks.com/help/ident/ref/arx.html) for system identification and time series prediction. I built a similar one with similar APIs, but take it a mile away to connect the interfaces with different machine learning packages, e.g. `sklearn`, `pytroch`, as the core.


0 comments on commit cb80475

Please sign in to comment.