# Policy Architectures We provide three vision-language models for encoding spatial-temporal information in robot learning for LLDM.
Three Vision-Language Policy Networks
How Sentence Embedding is Injected
### BCRNNPolicy (ResNet-LSTM) (See [Robomimic](https://arxiv.org/abs/2108.03298)) The visual information is encoded using a ResNet-like architecture, then the temporal information is summarized by an LSTM. The sentence embedding of the task description is added to the network via the [FiLM](https://arxiv.org/pdf/1709.07871.pdf) layer. ### BCTransformerPolicy (ResNet-Transformer) (See [VIOLA](https://arxiv.org/abs/2210.11339)) The visual information is encoded using a ResNet-like architecture, then the temporal information is encoded by a temporal transformer that uses the visual encoded representations as tokens. The sentence embedding of the task description is added to the network via the [FiLM](https://arxiv.org/pdf/1709.07871.pdf) layer. ### BCViLTPolicy (ViT-Transformer) (See [VilT](https://arxiv.org/abs/2102.03334)) The visual information is encoded using a ViT-like architecture, where the images are patchified. Then the temporal information is summarized by another transformer. The sentence embedding of the task description is treated as a token into the spatial ViT.