OCR Overview

This page explains how to use the /v1/chat/completions route to perform OCR (text recognition), image description, or more generally visual analysis from images (PNG, JPG, etc.).

This approach relies on Vision / OCR models, including advanced capabilities known as DeepSearch OCR.

Typical use cases include:

Document OCR (invoices, contracts, scans)

Reading printed or handwritten text

Document conversion to Markdown

Search and localization of elements in an image

Visual document analysis for an AI / RAG pipeline

General principle

The image is sent:
- either via a public URL
- or encoded in base64 (Data URL)
The user message combines:
- a text instruction
- one or more images
Depending on the prompt used, the model can operate in:
- simple OCR
- structured document OCR (Markdown)
- OCR with localization (DeepSearch / grounding)
- descriptive vision

Advanced OCR modes (DeepSearch OCR)

The Clovis gateway supports advanced OCR prompts, referred to here as DeepSearch OCR, enabling additional model capabilities.

These modes rely on the use of special tokens directly in the prompt.

Supported special tokens

<|grounding|>

Enables grounding mode (visual anchoring). When present:

the model attempts to link its response to the visual structure of the document
it favors outputs that are more faithful to the layout
it is used for:
- structured OCR
- Markdown conversion
- element localization

<|ref|> ... <|/ref|>

Allows you to explicitly specify text to search within the image. Mainly used with <|grounding|> for targeted localization use cases.

Predefined OCR modes

Markdown – structured document OCR

<image>\n<|grounding|>Convert the document to markdown.

Text extraction
Structure reconstruction (titles, lists, tables)
Ideal for scanned documents

Free OCR – raw OCR

<image>\nFree OCR.

Raw text extraction
No structuring
No localization

Locate – targeted search

<image>\n<|grounding|>Locate <|ref|>text<|/ref|> in the image.

Targeted search for a word or expression
Useful for:
- UI highlighting
- presence validation
- document interaction

Describe – descriptive vision

<image>\nDescribe this image in detail.

Global image description
Not strictly OCR-focused

Custom – free prompt

<image>\n[Custom prompt]

Fully customizable prompt
Can optionally use:
- <|grounding|>
- <|ref|>

Compatibility

⚠️ Tokens <|grounding|> and <|ref|> are model-specific.

Quick summary

Mode	Grounding	Usage
Markdown	✔️	Structured document OCR
Free OCR	❌	Raw text
Locate	✔️	Localization
Describe	❌	Vision
Custom	optional	Advanced use cases

General principle​

Advanced OCR modes (DeepSearch OCR)​

Supported special tokens​

Predefined OCR modes​

Markdown – structured document OCR​

Free OCR – raw OCR​

Locate – targeted search​

Describe – descriptive vision​

Custom – free prompt​

Compatibility​

Quick summary​