UForm v1: Multimodal Chat in 1.5 Billion Parameters
The UForm family of tiny multimodal transformer models just got bigger! In addition to the existing CLIP-like embedding models, we now have a generative model useful for image captioning, visual question answering, and multimodal chats. All that is in just a billion parameters, small enough to fit even on mobile devices 🎉
Repository: https://github.com/unum-cloud/uform
Generative model: https://huggingface.co/unum-cloud/uform-gen
Chat model: https://huggingface.co/unum-cloud/uform-gen-chat
Evaluation Metrics
![](https://github.com/ashvardanian/usearch-images/blob/main/assets/uform-gen-preview.jpg?raw=true)
Being the smallest model of its kind, `unum-cloud/uform-gen` is hard to compare to others. Next in size are the 5x larger LLaVAs and InstructBLIP, with 7 billion parameters. LLaVA performs noticeably better on VQAv2: 78.5 vs 66.5. On captioning, CLIPScore and RefCLIPScore are relatively close across all models.
| Model | Size | Caption Length | CLIPScore | RefCLIPScore |
| :---------------------------------- | ---: | -------------: | --------: | -----------: |