OCR (MiniCPM-V 4.6)
Initialize model
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText
model_path = "openbmb/MiniCPM-V-4.6"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
attn_implementation="sdpa",
).eval().cuda()
Usage example
ocr_image = Image.open("./assets/ocr.png").convert("RGB")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": ocr_image},
{"type": "text", "text": "What is the text in the picture?"},
],
}]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
out_ids = model.generate(**inputs, max_new_tokens=1024)
answer = processor.decode(
out_ids[0][inputs["input_ids"].shape[-1]:],
skip_special_tokens=True,
)
print(answer)
Sample image

For Chinese-heavy or dense text pages, the Thinking checkpoint (
openbmb/MiniCPM-V-4.6-Thinking) often produces more faithful transcriptions because it inspects the layout before answering. The Instruct checkpoint is faster and usually sufficient for short / well-cropped text.