Preserving and combining knowledge in robotic lifelong reinforcement learning

Paper

Introduce

机器人终身学习

基于深度学习的方法, 平衡神经网络的稳定性和可塑性, 这种情况下一个常见的问题是灾难性遗忘. 可以使用正则化, 结构模块化和经验回放, 但是更多应用于传统机器学习

在deep learning for reinforcement learning中, 常见的方法是通过多任务强化学习(MTRL). 在MTRL中, agent可以同时访问多个任务, 避免了神经网络固有的以往问题. 但是也有问题, 依赖与预定义的任务范围, 对zero-shot样本难以泛化.

受到Dirichlet过程混合模型(Dirichlet process mixture model, DPMM)的启发, 结合记忆变分贝叶斯推断模型(memorized variational Bayes inference method, memoVB), 在upstream level实现了simultaneous inference和asynchronous knowledge preservation.

Method

可以continuously gain knowledge from a steam of a one-time feeding tasks.

upstream module包含:

pretrained language embedding
task encoder
DPMM
generative module 训练过程: offline

LLM结合语音识别 pre-encoded Language Embeddings. 这一步通过消除计算密集型的实时编码来加速训练
task state observation(包含end-effector位置, objects的位置, goals的位置)与Language Embeddings结合, 送给task inference encoder
生成的inference result使用DPMM与knowledge space进行拟合(fit). 来自同一个任务的inferred result被clustered并储存在DPMM的相同component中. 如果是新的样本, 创建一个新的component来储存. 知识保存:
1. 使用task encoder得到的 $z_{i}$ 更新DPMM参数.
2. 固定DPMM参数, 更新task encoder:
  1. reconstruct loss: $x_{i}$ with $x_{i}^{*}$
  2. KL divergence: task encoder distribution with DPMM knowledge component
generative module重建language embeddings并预测当前任务的动态函数, 使得upstream和downstream之间可以解耦参数更新
1. 将 $z_{i}$ 作为输入, 重建language embeddings tokens, 与原始language embeddings token做reconstruction loss
2. 使用当前observed state, action, $z_{t}$ 进行对下一步state预测 $p_{θ} (s_{t + 1} ∣ s_{t}, a_{t}, z_{t})$ , 与真实执行(simulate)后的结果做reconstruction loss
downstream中使用SAC作为策略学习模块, critics计算 $Q (s_{t}, a_{t}, z_{t})$ , actor提供action $a_{t}$

在deployment(inference)过程中, 使用online encode方法. 使用Sim2Real和Real2Sim, 这两个模块包括安全控制检查, Sim和Real world坐标系转换, 手眼校准, 相机偏移设置.

Non-parametric knowledge space

Dirichlet process mixture model

假设 $G$ 为random probability measure, $H$ 是base probability distribution基于参数空间 $Θ$ , $α$ 是concentration parameter.

认为 $G$ 是Dirichlet过程中采样得到, 记作 $G \sim DP (α, H)$ . 使用stick-breaking方法进行从 $D P (α, H)$ 中sample

DPMM用于capture infinite mixture of clusters, 从observation $x = x_{1 : N}$ . DPMM的组件并没有固定, 而是online的方式去确定.

DPMM中, 认为每一个数据点 $x_{i} \sim F (θ_{i})$ , 其中 $θ_{i}$ 是从先验 $G$ 中独立采样的latent variable, 并通过允许 $θ_{i}$ 重复来引入discreteness和clustering properties. 因此, 使用相同的 $θ_{i}$ 绘制的data points自然形成一个聚类.

fit过程中, 为了data point分配给一个cluster, 把data point $x_{i}$ 和变量 $v_{i}$ 关联. $v_{i}$ 通过 $π \sim Cat (π)$ 来sample得到. 其中混合比例 $π$ 也可以等价表示为从广义的Ewens分布(GEM)中sample.

Knowledge Base

Explorer

Legion

Preserving and combining knowledge in robotic lifelong reinforcement learning

Introduce

Method

Non-parametric knowledge space

Dirichlet process mixture model

Variational Inference

Parametrics Modules

Language Embedding

Observation space

Action space

Reward

Optimization

Metrics

Graph View

Table of Contents

Backlinks