BNB

:::{Note} Support: MiniCPM-V 4.6 / MiniCPM-V 4.5 / MiniCPM-V 4.0 / MiniCPM-V 2.6 / MiniCPM-V 2.5 :::

1. Download the Model

Download the MiniCPM-V-4.6 model from HuggingFace (or openbmb/MiniCPM-V-4.6-Thinking for the think variant) and extract it to a local directory.

2. Quantization Script

The following script loads the original model, quantizes it to 4-bit using bitsandbytes, runs a sanity-check inference, and saves the quantized model.

Requires transformers>=5.7.0 (where MiniCPM-V 4.6 is registered as a standalone architecture).

import os
import time
import torch
from transformers import (
    AutoProcessor,
    AutoModelForImageTextToText,
    BitsAndBytesConfig,
)
from PIL import Image

assert torch.cuda.is_available(), "CUDA is not available, but this code requires a GPU."

device = "cuda"
model_path = "/model/MiniCPM-V-4.6"  # Original model path
save_path = "./model/MiniCPM-V-4.6-int4"  # Path to save quantized model
image_path = "./assets/airplane.jpeg"

# 4-bit quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    load_in_8bit=False,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_storage=torch.uint8,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    llm_int8_enable_fp32_cpu_offload=False,
    llm_int8_has_fp16_weight=False,
    llm_int8_skip_modules=["lm_head"],
    llm_int8_threshold=6.0,
)

processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    device_map=device,
    quantization_config=quantization_config,
)

image = Image.open(image_path).convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "What is in this picture?"},
    ],
}]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

start = time.time()
out_ids = model.generate(**inputs, max_new_tokens=256)
response = processor.decode(
    out_ids[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True,
)

print("Output after quantization:", response)
print("Inference time after quantization:", time.time() - start)

# Save the quantized model and tokenizer/processor
os.makedirs(save_path, exist_ok=True)
model.save_pretrained(save_path, safe_serialization=True)
processor.save_pretrained(save_path)

3. Expected Output

After quantization you should see something like:

Output after quantization: The image depicts an Airbus A380-800 aircraft in mid-flight against a clear blue sky. The airplane is predominantly white with a distinctive blue tail fin featuring a red and white logo. ...
Inference time after quantization: ~9 s on a single A100/4090

The quantized model will be saved at save_path and can be reloaded with AutoModelForImageTextToText.from_pretrained(save_path) for further inference or fine-tuning.

To customize the model path, image, or save path, edit the variables at the top of the script.