MiniCPM-V MiniCPM-V & o Cookbook

MiniCPM-V 4.6 - llama.cpp

Note

MiniCPM-V 4.6 support has been merged into the official llama.cpp (PR #22529) and is included starting from release b9049.

Compared to v4.5, v4.6 reworks the vision tower (the resampler is replaced by a new merger structure for better ViT efficiency), and GGUF conversion is now folded into the standard convert_hf_to_gguf.py flow β€” no model-specific surgery script is needed.

1. Build llama.cpp

Clone the upstream repository (require commit on/after release b9049):

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Build with CMake (see build docs for details):

CPU / Metal:

cmake -B build
cmake --build build --config Release

CUDA:

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

2. GGUF files

Option 1: Download official GGUF files

Download the language model file (e.g., MiniCPM-V-4.6-Q4_K_M.gguf) and the vision projector (mmproj-MiniCPM-V-4.6-F16.gguf) from:

The Thinking variant is published separately at:

Option 2: Convert from PyTorch model

Download the PyTorch checkpoint:

Run the standard convert_hf_to_gguf.py from the llama.cpp repo:

# 1) Convert the language model + vision merger to GGUF
python ./convert_hf_to_gguf.py /path/to/MiniCPM-V-4.6 \
    --outfile /path/to/MiniCPM-V-4.6/MiniCPM-V-4.6-F16.gguf \
    --outtype f16

# 2) Convert the vision projector (mmproj)
python ./convert_hf_to_gguf.py /path/to/MiniCPM-V-4.6 \
    --mmproj \
    --outfile /path/to/MiniCPM-V-4.6/mmproj-MiniCPM-V-4.6-F16.gguf

convert_hf_to_gguf.py autodetects MiniCPMV4_6ForConditionalGeneration from config.json and emits both the LM and the vision tower.

v4.6 no longer needs the legacy-models/minicpmv-surgery.py + minicpmv-convert-image-encoder-to-gguf.py two-step. If you're following an older v4.5 guide, ignore those scripts here.

3. Model Inference

Important

Recent llama.cpp builds (after PR #20606) default --reasoning to auto, which enables thinking from the chat template. The v4.6 Instruct checkpoint does not emit <think> blocks, but its template enables thinking by default, producing broken Instruct output. Always pass --reasoning off explicitly on Instruct inference commands. For the Thinking checkpoint, leave it as default or pass --reasoning on.

cd build/bin/

# F16 weights
./llama-mtmd-cli \
    -m  /path/to/MiniCPM-V-4.6/MiniCPM-V-4.6-F16.gguf \
    --mmproj /path/to/MiniCPM-V-4.6/mmproj-MiniCPM-V-4.6-F16.gguf \
    -c 8192 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 \
    --reasoning off \
    --image xx.jpg -p "What is in the image?"

# Quantized INT4 weights
./llama-mtmd-cli \
    -m  /path/to/MiniCPM-V-4.6/MiniCPM-V-4.6-Q4_K_M.gguf \
    --mmproj /path/to/MiniCPM-V-4.6/mmproj-MiniCPM-V-4.6-F16.gguf \
    -c 8192 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 \
    --reasoning off \
    --image xx.jpg -p "What is in the image?"

# Interactive mode
./llama-mtmd-cli \
    -m  /path/to/MiniCPM-V-4.6/MiniCPM-V-4.6-Q4_K_M.gguf \
    --mmproj /path/to/MiniCPM-V-4.6/mmproj-MiniCPM-V-4.6-F16.gguf \
    -c 8192 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 \
    --reasoning off \
    --image xx.jpg -i

If you're running the Thinking checkpoint, you can control the reasoning budget through --jinja + --reasoning-budget (--reasoning defaults to auto, which enables thinking for that checkpoint automatically):

# Allow unlimited thinking (Thinking model)
./llama-mtmd-cli \
    -m  /path/to/MiniCPM-V-4.6-Thinking/MiniCPM-V-4.6-Thinking-Q4_K_M.gguf \
    --mmproj /path/to/MiniCPM-V-4.6-Thinking/mmproj-MiniCPM-V-4.6-Thinking-F16.gguf \
    -c 8192 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 \
    --image xx.jpg --jinja --reasoning-budget -1 -p "What is in the image?"

# Skip the leading <think> block on a Thinking model
./llama-mtmd-cli \
    -m  /path/to/MiniCPM-V-4.6-Thinking/MiniCPM-V-4.6-Thinking-Q4_K_M.gguf \
    --mmproj /path/to/MiniCPM-V-4.6-Thinking/mmproj-MiniCPM-V-4.6-Thinking-F16.gguf \
    -c 8192 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 \
    --image xx.jpg --jinja --reasoning-budget 0 -p "What is in the image?"

The Instruct checkpoint has no <think> block, so --reasoning-budget is a no-op there.

Argument Reference:

Argument Description
-m, --model Path to the language model
--mmproj Path to the vision projector
--image Path to the input image
-p, --prompt The prompt
-c, --ctx-size Maximum context size
-rea, --reasoning [on\|off\|auto] Enable / disable thinking. Default auto (read from chat template). Must be off for Instruct; can stay default or be on for Thinking
--reasoning-budget Maximum tokens used for reasoning (-1 unlimited, 0 immediate end); only meaningful for the Thinking checkpoint
--jinja Use the model's Jinja chat template (enabled by default in recent llama.cpp)