What makes Pre-trained- Visual Representations Successful for Robust Manipulation

publish: CoRL 2024

Paper

motivate:

在visual distribution shift的评估下, 专门为了manipulation and control设计的模型并不比visual pre-trained的模型效果好
ViT(Visual Transformer) 的emergent segmentation是泛化的强预测指标

Environment, Evaluated Protocol and Pre-trained Models

冻结pre-trained visual encoder的基础上进行学习policy, 然后改变光照和纹理和物体等, 进行zero-shot的测试

Environment

使用FrankaKitchen和Meta-World两个测试环境

Distribution Shift

纹理和光照的shift, 以及干扰物

Policy Train

使用模仿学习, 最小化MSE(policy action和expert action)

Models Pre-Trained for Manipulation

R3M 和 VIP 显出优于baseline(ImageNet)

其他数据的影响会超过数据规模

Supervised ImageNet Models

ImageNet学到的特征, 即使冻结也能在各种模拟控制任务中与真实状态信息竞争

风格化 ImageNet 上进行监督训练在训练分布中实现了比使用掩码自编码损失的 ImageNet 自监督训练更高的成功率