Policy Architectures
Contents
Policy Architectures#
We provide three vision-language models for encoding spatial-temporal information in robot learning for LLDM.
data:image/s3,"s3://crabby-images/6a44f/6a44f6610ed3c8dc10898206b544a85321fc4083" alt=""
data:image/s3,"s3://crabby-images/815d1/815d172ef3c1a26ae7ea59de48b47bda92ddcc4d" alt=""
BCRNNPolicy (ResNet-LSTM)#
(See Robomimic)
The visual information is encoded using a ResNet-like architecture, then the temporal information is summarized by an LSTM. The sentence embedding of the task description is added to the network via the FiLM layer.
BCTransformerPolicy (ResNet-Transformer)#
(See VIOLA)
The visual information is encoded using a ResNet-like architecture, then the temporal information is encoded by a temporal transformer that uses the visual encoded representations as tokens. The sentence embedding of the task description is added to the network via the FiLM layer.
BCViLTPolicy (ViT-Transformer)#
(See VilT)
The visual information is encoded using a ViT-like architecture, where the images are patchified. Then the temporal information is summarized by another transformer. The sentence embedding of the task description is treated as a token into the spatial ViT.