MiniCPM-V MiniCPM-V & o Cookbook

MiniCPM-V 4.6 - SGLang Documentation

Note

MiniCPM-V 4.6 is supported on the official SGLang main branch since PR #24998 (merged 2026-05-12). No fork is required β€” install directly from upstream, making sure your checkout is at or after that commit.

MiniCPM-V 4.6 is registered in transformers>=5.7.0 as a standalone architecture (MiniCPMV4_6ForConditionalGeneration); the SGLang adapter follows that layout.

MiniCPM-V 4.6 ships as two checkpoints:

1. Installing SGLang

Install SGLang from upstream main

Until a tagged SGLang release ships with #24998, install from upstream main (watch SGLang Releases; once a release includes that PR, pip install -U "sglang[all]" is enough):

git clone https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

transformers>=5.7.0 is installed automatically β€” this in turn requires a recent PyTorch (β‰₯ 2.6 at the time of writing). Verify the resolved versions match what FlashInfer needs before installing FlashInfer below:

python -c "import torch, transformers; print('torch', torch.__version__, '| cuda', torch.version.cuda, '| transformers', transformers.__version__)"

Important

FlashInfer wheels are pinned to a specific (torch, cuda) combo. Pick the wheel index that matches the torch + CUDA you just verified above β€” don't blindly copy a cu121/torch2.4 URL, that will silently downgrade torch and break the SGLang / transformers install.

The general index lives at https://flashinfer.ai/whl/. Pick the directory matching your environment, for example:

Your torch / CUDA Index URL
torch 2.6 + CUDA 12.4 https://flashinfer.ai/whl/cu124/torch2.6/
torch 2.6 + CUDA 12.6 https://flashinfer.ai/whl/cu126/torch2.6/
torch 2.7 + CUDA 12.8 https://flashinfer.ai/whl/cu128/torch2.7/

Then install via either:

# Method 1 β€” pip from the right index (slow / blocked in CN)
pip install flashinfer-python -i <index URL from table above>

# Method 2 β€” download the matching wheel manually
#   1) Open the index URL in a browser, find a wheel that matches your
#      python version (cp310 / cp311 / ...) and platform (linux_x86_64 / win_amd64)
#   2) pip install <downloaded-wheel.whl>

For everything else (Docker images, CPU-only fallback, etc.) see the official SGLang installation docs.

2. Launching the Inference Server

By default the server downloads weights from the HuggingFace Hub:

python -m sglang.launch_server --model-path openbmb/MiniCPM-V-4.6 --port 30000 --trust-remote-code --dtype bfloat16

Or specify a local path:

python -m sglang.launch_server --model-path /your/local/MiniCPM-V-4.6 --port 30000 --trust-remote-code --dtype bfloat16

To serve the Thinking variant, swap the model id:

python -m sglang.launch_server --model-path openbmb/MiniCPM-V-4.6-Thinking --port 30000 --trust-remote-code --dtype bfloat16

3. Calling the Service

If image_url is not reachable from your machine, replace it with a local path / base64 data URL.

v4.6 uses the Qwen3.5 vocabulary β€” pass stop_token_ids = [248044, 248046] if you observe the model continuing past the answer.

For more invocation patterns, see the SGLang documentation.