MiniCPM-V MiniCPM-V & o Cookbook

Official Fine-tuning Scripts

We provide official scripts for fine-tuning pretrained MiniCPM-V / MiniCPM-o models on downstream tasks, using transformers Trainer and DeepSpeed.

Supported models: MiniCPM-V-4, MiniCPM-o-2_6, MiniCPM-V-2_6, MiniCPM-Llama3-V-2_5, MiniCPM-V-2.

1. Data Preparation

Each data sample should be a dictionary containing the image path and multi-turn conversation content. For example:

{
  "image": "path/to/image.jpg",
  "conversations": [
    {"role": "user", "content": "Describe this image."},
    {"role": "assistant", "content": "This is a cat."}
  ]
}

Multi-image Input Format

To support multiple images, the image field should be a dictionary, with keys like <image_00>, <image_01>, and values as image paths:

{
  "image": {
    "<image_00>": "path/to/image1.jpg",
    "<image_01>": "path/to/image2.jpg"
  },
  "conversations": [
    {"role": "user", "content": "Compare <image_00> and <image_01>."},
    {"role": "assistant", "content": "The first image is a cat, the second is a dog."}
  ]
}

Dataset Loading and Preprocessing

Dataset loading and preprocessing mainly rely on the SupervisedDataset class. The core process is as follows:

Main Parameters

Common Issues and Notes

2. Full-parameter Fine-tuning

Full-parameter fine-tuning requires updating all parameters of LLM in the whole training process. Please specify the correct MODEL path, DATA path and LLM_TYPE in the shell scripts.

You can find and review the training script here: finetune_ds.sh

MODEL="MiniCPM-o-2_6" # or "openbmb/MiniCPM-V-4", "openbmb/MiniCPM-V-2_6", "openbmb/MiniCPM-Llama3-V-2_5", "openbmb/MiniCPM-V-2"
DATA="path/to/trainging_data" # json file
EVAL_DATA="path/to/test_data" # json file
LLM_TYPE="qwen" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm, if use openbmb/MiniCPM-Llama3-V-2_5, please set LLM_TYPE="llama3",
# if use openbmb/MiniCPM-o-2_6 or openbmb/MiniCPM-V-2_6, please set LLM_TYPE=qwen
# if use openbmb/MiniCPM-V-4, please set LLM_TYPE=llama

To launch your training, run the following script:

sh finetune_ds.sh

3. LoRA Fine-tuning

LoRA allows light-weight model tuning with only a small subset of parameters updated. We provide the LoRA implementation based on peft. To launch your training, run the following script:

You can find and review the training script here: finetune_lora.sh

MODEL="MiniCPM-o-2_6" # or "openbmb/MiniCPM-V-4", "openbmb/MiniCPM-V-2_6", "openbmb/MiniCPM-Llama3-V-2_5", "openbmb/MiniCPM-V-2"
DATA="path/to/trainging_data" # json file
EVAL_DATA="path/to/test_data" # json file
LLM_TYPE="qwen" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm, if use openbmb/MiniCPM-Llama3-V-2_5, please set LLM_TYPE="llama3",
# if use openbmb/MiniCPM-o-2_6 or openbmb/MiniCPM-V-2_6, please set LLM_TYPE=qwen
# if use openbmb/MiniCPM-V-4, please set LLM_TYPE=llama

To launch your training, run the following script:

sh finetune_lora.sh

4. Loading a Fine-tuned Model

After training (either full-parameter or LoRA), you can load the model with the path to the adapter. We advise you to use absolute path for your pretrained model, because LoRA only saves the adapter and the absolute path in the adapter configuration json file is used for finding out the pretrained model to load.

from peft import PeftModel
from transformers import AutoModel

model_type = "openbmb/MiniCPM-o-2_6"  # or "openbmb/MiniCPM-V-4", "openbmb/MiniCPM-V-2_6", "openbmb/MiniCPM-Llama3-V-2_5", "openbmb/MiniCPM-V-2"
path_to_adapter = "path_to_your_fine_tuned_checkpoint"

model = AutoModel.from_pretrained(
    model_type,
    trust_remote_code=True,
)

lora_model = PeftModel.from_pretrained(
    model,
    path_to_adapter,
    device_map="auto",
    trust_remote_code=True,
).eval().cuda()

5. Memory Usage Statistics

The following table presents the memory usage when fine-tuning using NVIDIA A100 (80 GiB) GPUs with DeepSpeed Zero-3, gradient checkpointing, and CPU offloading (max length 2048, batch size 1).

Fine-tuning Method GPUs: 2 GPUs: 4 GPUs: 8
LoRA 14.4 GiB 13.6 GiB 13.1 GiB
Full-parameter 16.0 GiB 15.8 GiB 15.6 GiB

Refer to DeepSpeed Zero stages to further reduce memory cost.