Add 3 text-image multimodal models
* **CLIP**
* page: [*image-text-embedding/clip*](https://towhee.io/image-text-embedding/clip)
* paper: [*Learning Transferable Visual Models From Natural Language Supervision*](https://arxiv.org/pdf/2103.00020.pdf)
* **BLIP**
* page: [*image-text-embedding/blip*](https://towhee.io/image-text-embedding/blip)
* paper: [*BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation*](https://arxiv.org/pdf/2201.12086.pdf)
* **LightningDOT**
* page: [*image-text-embedding/lightningdot*](https://towhee.io/image-text-embedding/lightningdot)
* paper: [*LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval*](https://arxiv.org/pdf/2103.08784.pdf)
Add 6 video understanding/classification models
* **I3D** (from PyTorchVideo)
* page: [*action-classification/pytorchvideo*](https://towhee.io/action-classification/pytorchvideo)
* paper: [*Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset*](https://arxiv.org/pdf/1705.07750.pdf)
* **C2D** (from PyTorchVideo)
* page: [*action-classification/pytorchvideo*](https://towhee.io/action-classification/pytorchvideo)
* paper: [*Non-local Neural Networks*](https://arxiv.org/pdf/1711.07971.pdf)
* **Slow** (from PyTorchVideo)
* page: [*action-classification/pytorchvideo*](https://towhee.io/action-classification/pytorchvideo)
* paper: [*SlowFast Networks for Video Recognition*](https://arxiv.org/pdf/1812.03982.pdf)
* **SlowFast** (from PyTorchVideo)
* page: [*action-classification/pytorchvideo*](https://towhee.io/action-classification/pytorchvideo)
* paper: [*SlowFast Networks for Video Recognition*](https://arxiv.org/pdf/1812.03982.pdf)
* **X3D** (from PyTorchVideo)
* page: [*action-classification/pytorchvideo*](https://towhee.io/action-classification/pytorchvideo)
* paper: [*X3D: Expanding Architectures for Efficient Video Recognition*](https://arxiv.org/pdf/2004.04730.pdf)
* **MViT** (from PyTorchVideo)
* page: [*action-classification/pytorchvideo*](https://towhee.io/action-classification/pytorchvideo)
* paper: [*Multiscale Vision Transformers*](https://arxiv.org/pdf/2104.11227.pdf)