MiniCPM-V 4.6 vLLM Deployment Guide

MiniCPM-V 4.6 ships as two separate checkpoints:

Variant	HuggingFace ID	ModelScope ID
Instruct	openbmb/MiniCPM-V-4.6	OpenBMB/MiniCPM-V-4.6
Thinking	openbmb/MiniCPM-V-4.6-Thinking	OpenBMB/MiniCPM-V-4.6-Thinking

Unlike v4.5 (which switched modes via enable_thinking), v4.6 ships think and instruct as independent checkpoints — pick the one that matches your use case.

1. Environment Setup

1.1 Install vLLM (from PR branch, recommended for now)

Note

The vLLM upstream PR (#41254) is currently under review. Until it lands in an official release, please install from the PR branch.

# Create a clean conda environment
conda create -n vllm-v46 python=3.10 -y
conda activate vllm-v46

# Clone the PR branch and install
git clone -b Support-MiniCPM-V-4.6 https://github.com/tc-mb/vllm.git vllm-v46
cd vllm-v46

MAX_JOBS=6 VLLM_USE_PRECOMPILED=1 pip install --editable . -v --progress-bar=on

For video inference, install the video module:

pip install vllm[video]

This branch already requires transformers>=5.7.0, in which MiniCPM-V 4.6 has been merged as a standalone architecture (MiniCPMV4_6ForConditionalGeneration).

Once the PR is merged, you'll be able to install directly from PyPI (pip install vllm). The cookbook will be updated with the supported version.

You can verify the installation with:

python -c "import vllm, transformers; print('vllm', vllm.__version__, '| transformers', transformers.__version__)"

2. API Service Deployment

2.1 Launch API Service

vllm serve <model_path> \
  --dtype auto \
  --max-model-len 8192 \
  --api-key token-abc123 \
  --gpu_memory_utilization 0.9 \
  --trust-remote-code \
  --max-num-batched-tokens 8192

Parameter Description: - <model_path>: Local path to MiniCPM-V-4.6, or the HuggingFace ID openbmb/MiniCPM-V-4.6 / openbmb/MiniCPM-V-4.6-Thinking - --api-key: API access key - --max-model-len: Maximum context length. v4.6 supports up to 256K, but start small to fit GPU memory - --gpu_memory_utilization: GPU memory utilization rate

2.2 Image Inference

from openai import OpenAI
import base64

openai_api_key = "token-abc123"  # must match the value passed to `vllm serve`
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)

with open('./assets/airplane.jpeg', 'rb') as file:
    image = "data:image/jpeg;base64," + base64.b64encode(file.read()).decode('utf-8')

chat_response = client.chat.completions.create(
    model="<model_path>",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Please describe this image"},
            {"type": "image_url", "image_url": {"url": image}},
        ],
    }],
    extra_body={
        # v4.6 uses Qwen3.5 backbone with new vocab — note the different stop token IDs
        "stop_token_ids": [248044, 248046]
    }
)

print("Chat response:", chat_response)
print("Chat response content:", chat_response.choices[0].message.content)

2.3 Video Inference

from openai import OpenAI
import base64

openai_api_key = "token-abc123"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)

with open('./videos/video.mp4', 'rb') as video_file:
    video_base64 = base64.b64encode(video_file.read()).decode('utf-8')

chat_response = client.chat.completions.create(
    model="<model_path>",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Please describe this video"},
                {"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_base64}"}},
            ],
        },
    ],
    extra_body={"stop_token_ids": [248044, 248046]}
)

print(chat_response.choices[0].message.content)

v4.6 keeps video pipeline simple: the image processor is invoked frame-by-frame, and the per-video frame budget is controlled by mm_processor_kwargs/video_processor.max_slice_nums. No special video backend is required.

To raise/lower the per-video frame slice budget, pass mm_processor_kwargs:

extra_body = {
    "stop_token_ids": [248044, 248046],
    "mm_processor_kwargs": {"max_slice_nums": 2},
}

2.4 Thinking Mode

If you serve the openbmb/MiniCPM-V-4.6-Thinking checkpoint, the chat template injects a <think> block by default and the assistant returns reasoning followed by </think> then the final answer. You can shortcut the template via chat_template_kwargs:

extra_body = {
    "stop_token_ids": [248044, 248046],
    # disable the leading <think> block on a Thinking model
    "chat_template_kwargs": {"enable_thinking": False},
}

The Instruct checkpoint never emits <think> blocks — pick the right checkpoint instead of toggling at request time.

2.5 Multi-turn Conversation

Launch Parameter Configuration

For multi-image / multi-video conversations, raise the --limit-mm-per-prompt budget at launch:

# allow up to 3 videos per request
vllm serve <model_path> --dtype auto --max-model-len 16384 --api-key token-abc123 \
  --gpu_memory_utilization 0.9 --trust-remote-code \
  --limit-mm-per-prompt '{"video": 3}'

# mixed images + video
vllm serve <model_path> --dtype auto --max-model-len 16384 --api-key token-abc123 \
  --gpu_memory_utilization 0.9 --trust-remote-code \
  --limit-mm-per-prompt '{"image": 5, "video": 2}'

Multi-turn Conversation Example

from openai import OpenAI
import base64
import mimetypes
import os

openai_api_key = "token-abc123"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)

messages = [{"role": "system", "content": "You are a helpful assistant."}]

def file_to_base64(file_path):
    with open(file_path, 'rb') as f:
        return base64.b64encode(f.read()).decode('utf-8')

def get_mime_type(file_path):
    mime, _ = mimetypes.guess_type(file_path)
    return mime or 'application/octet-stream'

def build_file_content(file_path):
    mime_type = get_mime_type(file_path)
    base64_data = file_to_base64(file_path)
    url = f"data:{mime_type};base64,{base64_data}"
    if mime_type.startswith("image/"):
        return {"type": "image_url", "image_url": {"url": url}}
    elif mime_type.startswith("video/"):
        return {"type": "video_url", "video_url": {"url": url}}
    else:
        print(f"Unsupported file type: {mime_type}")
        return None

while True:
    user_text = input("Please enter your question (type 'exit' to quit): ")
    if user_text.strip().lower() == "exit":
        break

    content = [{"type": "text", "text": user_text}]

    upload_file = input("Upload a file? (y/n): ").strip().lower() == 'y'
    if upload_file:
        file_path = input("Please enter file path: ").strip()
        if os.path.exists(file_path):
            file_content = build_file_content(file_path)
            if file_content:
                content.append(file_content)
        else:
            print("File path does not exist, skipping file upload.")

    messages.append({"role": "user", "content": content})

    chat_response = client.chat.completions.create(
        model="<model_path>",
        messages=messages,
        extra_body={"stop_token_ids": [248044, 248046]},
    )

    ai_message = chat_response.choices[0].message
    print("MiniCPM-V 4.6:", ai_message.content)

    messages.append({"role": "assistant", "content": ai_message.content})

3. Offline Inference

from transformers import AutoProcessor
from PIL import Image
from vllm import LLM, SamplingParams

MODEL_NAME = "<model_path>"
# Or directly:
# MODEL_NAME = "openbmb/MiniCPM-V-4.6"

image = Image.open("./assets/airplane.jpeg").convert("RGB")
processor = AutoProcessor.from_pretrained(MODEL_NAME)

llm = LLM(
    model=MODEL_NAME,
    max_model_len=8192,
    trust_remote_code=True,
    disable_mm_preprocessor_cache=True,
    limit_mm_per_prompt={"image": 5},
)

messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Please describe the content of this image"},
    ],
}]

prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image,
        # for multi-image: "image": [image1, image2]
    },
}

sampling_params = SamplingParams(
    stop_token_ids=[248044, 248046],
    temperature=0.7,
    top_p=0.8,
    max_tokens=1024,
)

outputs = llm.generate(inputs, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Notes

Model Path: replace <model_path> with your local path or one of openbmb/MiniCPM-V-4.6 / openbmb/MiniCPM-V-4.6-Thinking.
Stop Tokens: v4.6 uses the Qwen3.5 vocabulary; the correct stop_token_ids are [248044, 248046] (v4.5 used [1, 151645]).
API Key: ensure the key passed to vllm serve matches the client.
Memory: tune --gpu_memory_utilization, --max-model-len, and --max-num-batched-tokens for your hardware. v4.6 supports up to 256K context, but you typically don't need it.
Multimodal Limits: set --limit-mm-per-prompt for multi-image / multi-video sessions.