Multimodal

Send images alongside text for vision-capable models.

Overview

Multimodal models accept both text and image inputs in a single request, enabling image analysis, document understanding, chart reading, and visual Q&A.

Models with image input include GPT-4o, Claude Sonnet/Opus, and Gemini.

Sending images

Pass images as part of the messages array using the image_url content type:

TypeScript
1const completion = await client.chat.completions.create({
2 model: 'gpt-4o',
3 messages: [
4 {
5 role: 'user',
6 content: [
7 { type: 'text', text: 'What is in this image?' },
8 {
9 type: 'image_url',
10 image_url: { url: 'https://example.com/photo.jpg' },
11 },
12 ],
13 },
14 ],
15});

Base64 images

Send images as base64-encoded data URIs when the image isn't publicly accessible:

TypeScript
1const base64Image = fs.readFileSync('photo.jpg', 'base64');
2
3const completion = await client.chat.completions.create({
4 model: 'claude-sonnet-4',
5 messages: [
6 {
7 role: 'user',
8 content: [
9 { type: 'text', text: 'Describe this image' },
10 {
11 type: 'image_url',
12 image_url: {
13 url: `data:image/jpeg;base64,${base64Image}`,
14 },
15 },
16 ],
17 },
18 ],
19});

Supported formats: JPEG, PNG, GIF, WebP. Maximum size varies by model (typically 20MB).

Supported models

Check a model's inputModalities field to see if it supports image input. Models with Image in their input modalities accept multimodal requests.

Filter for multimodal models on the Models page using the "Input Modalities" filter.