📝 Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning 🔭
"DiST uses a dual-encoder structure, where a pre-trained foundation model acts as the spatial encoder, and a lightweight network is introduced as the temporal encoder." [gal30b+] 🤖 #CV
⚙️ https://github.com/alibaba-mmai-research/DiST
🔗 https://arxiv.org/abs/2309.07911v1 #arxiv
https://creative.ai/system/media_attachments/files/111/076/386/412/782/112/original/3a25e41a0cbc3f53.jpg
https://creative.ai/system/media_attachments/files/111/076/386/475/712/105/original/5a1be656006a8e2b.jpg
https://creative.ai/system/media_attachments/files/111/076/386/536/187/709/original/bc88dc381d8271f1.jpg
https://creative.ai/system/media_attachments/files/111/076/386/612/444/428/original/ca973b82f0ab6b28.jpg