## Getting started ### 1. Find a multimodal model GGUF models with vision capabilities are uploaded along a `mmproj` file to Hugging Face. For instance, [unsloth/gemma-3-4b-it-GGUF](https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/tree/main) has this: print1 ### 2. Download the model to `user_data/models` As an example, download https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-Q4_K_S.gguf?download=true to your `text-generation-webui/user_data/models` folder. ### 3. Download the associated mmproj file to `user_data/mmproj` Then download https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/resolve/main/mmproj-F16.gguf?download=true to your `text-generation-webui/user_data/mmproj` folder. Name it `mmproj-gemma-3-4b-it-F16.gguf` to give it a recognizable name. ### 4. Load the model 1. Launch the web UI 2. Navigate to the Model tab 3. Select the GGUF model in the Model dropdown: print2 4. Select the mmproj file in the Multimodal (vision) menu: print3 5. Click "Load" ### 5. Send a message with an image Select your image by clicking on the 📎 icon and send your message: print5 The model will reply with great understanding of the image contents: print6 ## Multimodal with ExLlamaV3 Multimodal also works with the ExLlamaV3 loader (the non-HF one). No additional files are necessary, just load a multimodal EXL3 model and send an image. Examples of models that you can use: - https://huggingface.co/turboderp/gemma-3-27b-it-exl3 - https://huggingface.co/turboderp/Mistral-Small-3.1-24B-Instruct-2503-exl3 ## Multimodal API examples In the page below you can find some ready-to-use examples: [Multimodal/vision (llama.cpp and ExLlamaV3)](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#multimodalvision-llamacpp-and-exllamav3)