13-14. Exploration and Exploitation

Basics: Exploration vs Exploitation¶

Example: Mao Card Game

以 Mao 这个纸牌游戏为例：

事先主持人想好规则，但是不告知玩家
玩家出牌，如果符合规则，就可以出；如果不符合规则，那么就出不了，且接受惩罚（再摸一张牌）
先出完的获胜

那么，可以认为玩家本身身处一个 POMDP 之中，而且 training 和 testing 是 simultaneous 的。从而这就是一个 online POMDP 最优化：

如果 explore，那就 possibly stuck at sub-optimal action
如果 exploit，那就可能会遭受很多损失（i.e. 犯很多次规，摸很多张牌），但是同时发现 better actions

Often, in reality (large, infinite MDPs with continuous spaces), exploration is HARD.

HARD means it's not very possible to find the optimal exploration strategy

下面图表展示了理论上解出 optimal strategy 的难度：

下图是具体找出的方法：

MAB (Multi-Armed Bandits)¶

Definition¶

There are $N$ bandits, whose rewards are sampled from their own distributions, which are

fixed during the entire round (unlike adversarial bandits)
unknown beforehand, but subject to some known distribution

And here is the formal definition of "bandit":

\[ \begin{aligned} &\text{assume } r(a_i) \sim p_{\theta_i}(r_i) \newline &\quad \text{e.g. } p_{\theta_i}(r_i=1)=\theta_i \text{ and } p_{\theta_i}(r_i = 0) = 1 - \theta_i \newline &\theta_i \sim p(\theta), \text{but otherwise unknown} \newline \end{aligned} \]

实际上，这是一个 POMDP 问题。下面是最简单的 multi-armed bandit 模型对应的 POMDP 模型：

我们只有一个（连续）状态 $\mathbf s = [\theta_1, \theta_2, \dots, \theta_n]$
根据 $(a_0, o_1, a_1, \dots, o_{i})$，做出动作 $a_i$ 之后，根据 $\theta_i$ 抽样一个随机 reward $r_i$
根据 $a_i, r_i, \mathbf s$，我们得到 $o_{i+1}$

POMDP 图示

广义的 POMDP 如下图。

但是，对于我们的情况，稍有不同：

我们事先就知道 $s_0 = s_1 = \dots$
我们不知道 $o_0$
我们通过 $A_0$、$R_0$ 和 $s_1 (=s_0)$ 两个参数决定 $o_1$

我们可以通过 POMDP 来暴力解，但是杀鸡焉用牛刀，我们可以通过更加简单、高效的算法，得出理论上渐进最优的结果。

Regret and Its Analysis¶

Info

我们目前讨论的是 multi-armed bandit，参数 $\mathbf \theta$ 在从 $p(\theta)$ 采样之后，就是确定的。

因此，对于任意 $\mathbf \theta$，必然存在最优的动作：$a^\star$。

假设最好的动作是 $\alpha^\star$，那么 regret 定义为：

$$ \text{Reg}(T) = T \mathrm E[r(a^\star)] - \sum_{t=1}^T r(a_t) $$ 下面，介绍几种渐进最优的算法。

Optimistic Exploration¶

注意：这里的 regret 的大 O 里面，不包含 regret 的 bound 以及 bandits 的数量

Probability Matching / Posterior Sampling¶

流程：

在时间 $t$ 做动作之前时，我们通过之前几轮的信息，求出 $\theta_{1, \dots, n}$ 的 likelihood
- 即 $\hat p_t(\theta_1, \dots, \theta_n | p(\mathbf \theta), a_1, r_1, a_2, r_2, \dots, a_{t-1}, r_{t-1})$
在 $\hat p_t$ 中进行采样：$\theta_1^t, \theta_2^t, \dots, \theta_n^t \sim \hat p_t$
假装 $\theta_1^t, \theta_2^t, \dots, \theta_n^t$ 就是正确的模型，然后依照该模型求出”最优“的 $a_t$
$t = t+1$

Info

Posterior sampling 又称为 Thompson sampling

理论上难以分析
实践中很好用
广义上，posterior sampling 包含 a whole class of algorithms, which are very commonly studied both in bandits and deep RL

Information Gain¶

在 stochastic multi-armed bandits 的 setting 之下：

$$ \operatorname{IG}(z,y|a) = \mathbf E_y[\mathcal H(\hat p(z)) - \mathcal H(\hat p(z) | y) | a] $$ - 说人话就是：我们在当前 likelihood $\hat p_t$ 的情况下，做出动作 $a$，那么我们的 certainty 增加了多少？

其中一个算法如下所示：

解释： - $y = r_a$: 做出动作 $a$ 之后，从某个分布中获得奖励 $r_a$ - $z = \theta_a$: 做动作 $a$，只会改变 $\theta_a$ 的 likelihood，而不会对其它有影响，因此我们这里让 $z$ 为 $\theta_a$ - $\Delta(a)$: 使用当前的 $\theta_{1, \dots, n}$，估计出 $r(a^\star)$ 和 $r(a)$，然后进行比较

总结¶

Methods¶

注解 (General Principle of These Methods):

首先，我们在通过某些方式，引入一个 uncertainty
然后，我们在执行动作前，需要把这个 uncertainty 以某种方式加入到 $\mathop{\arg \max}$ 中

Why Do We Choose MAB?¶

Bandits are easier to analyze and understand
Can derive foundations for exploration methods
Then apply these methods to more complex MDPs

To Be Covered ...¶

Not covered here: - Contextual bandits (bandits with state, essentially 1-step MDPs) - Optimal exploration in small MDPs - Bayesian model-based reinforcement learning (similar to information gain) - Probably approximately correct (PAC)

Exploration in Deep RL¶

我们希望将 MAB 的方法，迁移到 Deep RL 上面去。

Pseudo-Counts and Bonus Functions¶

对于高维乃至连续的场景，使用场景计数

是 impractical 甚至 impossible 的
而且从直觉上来说也不对
- 难道一个场景，只要变了一个像素，就是另外一个场景了吗？

因此，我们用生成式模型来拟合概率分布。

Brief Introduction and Taxonomy of Generative Models

假设 $\theta$ 是生成式模型的参数：

\[ \theta^\ast = \mathop{\arg\max}_\theta \prod_{i=1}^N p_\theta(s_i) = \mathop{\arg\max}_\theta \sum_{i=1}^N \log p_\theta(s_i) \]

Taxonomy (图片右下角的文字与本课程无关)：

算法 pipeline 如下图：

估计 $\hat N, \hat n$ (Pseudo-Counts)

然后，至于 $\hat N, \hat n$ 如何算出来，我们这样做（具体算式见下图）：

利用 $p_\theta(s)$（由 $\mathcal D$ 拟合而来）和 $p_{\theta '}(s)$（由 $\mathcal D \cup \mathbf s_i$ 拟合而来）这两个概率的值
以及 $\hat N(s), \hat n$ 和 $p_\theta(s), p_{\theta'}(s)$ 之间的关系
联立方程，即可求出任意 $s$ 对应的 $\hat N(s)$
（然后，我们就可以求出 $r_i^+ = r_i + \mathcal B(\hat N(s))$，从而决定我们的动作）

使用什么 $\mathcal B$ (Bonus Functions)

使用什么生成模型

我们要使用可以直接得到 $p(s)$ 的模型，而不是只能从中采样的模型。详见上面的 taxonomy。

More Novelty-Seeking Exploration¶

Implementation: Counting with Hashes¶

使用 auto encoder 压缩，然后将压缩的向量通过 sgn 进行离散化。

Implementation: Implicit Density Modeling with Exemplar Models¶

\[ p_\theta(\mathbf s) = \frac {1 - D_{\mathbf s}(\mathbf s)} {D_{\mathbf s}(\mathbf s)} \]

我们使用非常简单的一个 classifier。目标是将数据分成两类：$\mathbf s$ 和非 $\mathbf s$。

直观上讲，如果 $\mathbf s$ 和其它数据 $\mathcal D$ 类似（i.e. 之前看过类似的），那么就难以分辨，从而 $D_{\mathbf s}(\mathbf s)$ 会比较小，从而概率比较大；反之亦然。

因此，直观上，就可以直接视为某种“概率”

另外，上面的 $p_\theta(\mathbf s)$ 的公式，也可以通过下面的思路，来 heuristically derived.

TODO¶

Posterior Sampling in Deep RL¶

由于 $Q$ learning 是 off-policy，因此使用任何一个 $Q$ 来实际上收集策略，都没有问题。

Implementation: Bootstrap¶

和 model-based RL 一样，我们使用 bootstrap 的方式来训练——使用 resample with replacement 的方式，得到 $N$ 个两两独立的数据集，然后一共训练 $N$ 个模型。之后，我们每一次 sample 的时候，都从这 $N$ 个模型中随机抽取一个。

为了减少模型参数，我们可以 trade-off 一下：采用下面的 shared-network and multi heads 的架构。

"heads" basically means that all layers are shared except for the last layer, and there are N copies of last layers, which correspond to the N networks, respectively.

Note

实际上，我们根本不需要 resample with replacement。即便我们用同样的数据来训练，由于 SGD 优化器的随机性太强，一开始这几个 head 之间几乎就是两两独立。

Question: Why do we use this¶

相比 $\epsilon$-greedy 而言：后者有时候的实际行为，就是某种程度上的随机游走。以 cliff-walking 为例，这样的随机游走，会导致最终无法收敛到最优策略上（因为在悬崖边，很容易掉下去，因此还不如绕远一点）。

但是，上面的 bootstrap 方法，虽然每个 Q 函数都不一样，but they are all reasonable and internally consistent (and tend to work well by the end, unlike random strategy).

也许我们会走不同的悬崖边缘路线，但是绝不会走下悬崖

相比 bonus 方法：

优点在于不用 fine-tuning
缺点在于实际效果比不上精选的 bonus 函数

In practice，这不常用——如果对 exploration 有要求的话，就用 bonus 就好。但是，总之这是研究热门。

Information Gain¶

First things first, 我们要确定是什么的 information gain：

reward $r(\mathbf{s, a})$: 对于稀疏奖励不好用
state density $p(\mathbf s)$: makes sense though strange
dynamics $p(\mathbf{s' | s, a})$: good proxy for learning the MDP, though still heuristic

State Density: Prediction Gain¶

\[ \log p_{\theta'} (\mathbf s) - \log p_\theta (\mathbf s) \]

更新前后的参数分别为 $\theta, \theta'$。如果差别大，那么就说明获取的信息多，从而 gain 更多。

具体如何计算，还是需要看论文。

Dynamics: Variational Inference¶

令 $\xi_t = \{s_1, a_1, \dots, s_t\}$。那么，在时间 $t$ 的时候，gain 就是：

\[ H(\Theta | \xi_t, a_t) - H(\Theta|S_{t+1}, \xi_t, a_t) = I(S_{t+1}; \Theta| \xi_t, a_t) = \mathbb E_{s_{t+1} \sim \mathcal P(\cdot | \xi_t, a_t)} [D_{\text{KL}} [p(\theta|\xi_t, a_t, s_{t+1}) \| p(\theta | \xi_t)]] \]

从而，the trade-off between exploitation and exploration can now be realized explicitly as follows:

\[ r'(s_t, a_t, s_{t+1}) = r(s_t, a_t) + \eta D_{\text{KL}}[p(\theta|\xi_t, a_t, s_{t+1}) \| p(\theta | \xi_t)] \]

缺点就是，$p(\theta|\xi_t, a_t, s_{t+1})$ 没法直接算出来，只能用贝叶斯定理嵌套一层：

\[ p(\theta|\xi_t, a_t, s_{t+1}) = \frac {p(\theta|\xi_t) p(s_{t+1} | \xi_t, a_t; \theta)} {p(s_{t+1}|\xi_t, a_t)} \]

where $p(\theta | \xi_t, a_t) = p(\theta | \xi_t)$, as actions don't affect one's belief of environment.

其中，分母就是分子的积分：$p(s_{t+1}|\xi_t, a_t) = \int_{\Theta} p(\theta|\xi_t) p(s_{t+1} | \xi_t, a_t; \theta) \mathrm d\theta$，这是 computationally intractable 的。

因此，我们决定使用 variational inference，也就是使用 $q(\theta; \phi)$ 去拟合 $p(\theta | \mathcal D)$。

本质上就是用 BNN 替换了之前固定的 NN
$\theta$ 中的每一个参数，在 $q$ 中都是独立的，然后 $\theta_i \sim \mathcal N(\phi^\mu_i, \phi^\sigma_i)$

然后，因为我们希望最小化 $D_\text{KL}[q(\theta; \phi) \| p(\theta|\mathcal D)]$，因此就可以最大化：

\[ \begin{aligned} L[q(\theta; \phi), \mathcal D] &= \mathbb E_{\theta \sim p(\cdot | \phi)} [\log p(\mathcal D | \theta)] - D_\text{KL}[q(\theta; \phi) \| p(\theta)] \newline \end{aligned} \]

然后 $\log p(\mathcal D) = \mathbb E_{\theta \sim p(\cdot | \phi)} [\log p(\mathcal D | \theta)]$，因此直接在 $\theta$ 和 $\mathcal D$ 上进行抽样即可。后面的 $D_\text{KL}[q(\theta; \phi) \| p(\theta)]$，因为 $p(\theta)$ 是可以求出的（而 $p(\theta|\mathcal D)$ 是没法直接求出的。

算法如下，整体流程还是清晰的：

在内循环内，使用变分估计的 $q$ 来求出 $r'$
在外循环中，使用 $L[q(\theta; \phi), \mathcal D]$ 来拟合 $q$，同时优化 $\pi_\alpha$

Exploration with model errors¶

我们实际上只需要 $D_\text{KL}(q(\theta|\phi') \| q(\theta|\phi))$，根本不一定需要 information gain 这个东西。