MiniCPM-V MiniCPM-V & o Cookbook

MiniCPM-V 4.6 - SGLang Documentation

Note

SGLang upstream support for MiniCPM-V 4.6 is currently being merged. Until the PR lands in an official release, please install SGLang from the OpenBMB SGLang fork below.

MiniCPM-V 4.6 is registered in transformers>=5.7.0 as a standalone architecture (MiniCPMV4_6ForConditionalGeneration); the SGLang adapter follows that layout.

MiniCPM-V 4.6 ships as two checkpoints:

1. Installing SGLang

Install SGLang from the PR / fork branch

# clone the OpenBMB-maintained SGLang branch with v4.6 support
git clone -b Support-MiniCPM-V-4.6 https://github.com/tc-mb/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

transformers>=5.7.0 will be installed automatically.

Method 1 β€” pip install:

pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

Method 2 β€” whl file:

bash pip install flashinfer-0.1.6+cu121torch2.4-cp310-cp310-linux_x86_64.whl

For more details, see the official SGLang installation docs.

2. Launching the Inference Server

By default the server downloads weights from the HuggingFace Hub:

python -m sglang.launch_server --model-path openbmb/MiniCPM-V-4.6 --port 30000 --trust-remote-code

Or specify a local path:

python -m sglang.launch_server --model-path /your/local/MiniCPM-V-4.6 --port 30000 --trust-remote-code

To serve the Thinking variant, swap the model id:

python -m sglang.launch_server --model-path openbmb/MiniCPM-V-4.6-Thinking --port 30000 --trust-remote-code

3. Calling the Service

If image_url is not reachable from your machine, replace it with a local path / base64 data URL.

v4.6 uses the Qwen3.5 vocabulary β€” pass stop_token_ids = [248044, 248046] if you observe the model continuing past the answer.

For more invocation patterns, see the SGLang documentation.