What is multi-modal prompting?

Multi-modal prompting is the practice of combining text, images, code, audio, or video in a single AI request. Instead of describing an image in words, you send the actual image with text instructions. This enables tasks like visual code review, UI analysis, diagram interpretation, and document extraction.

Which AI models support multi-modal input?

GPT-4o and GPT-4 Vision (OpenAI), Claude 3.5 Sonnet and Opus (Anthropic), and Gemini Pro/Ultra (Google) all support image + text input. Gemini additionally supports video and audio input. Each model has different resolution limits and processing capabilities for visual content.

How do I prompt AI to analyse code screenshots?

Send the screenshot with explicit instructions: specify what to look for (bugs, style issues, architecture), the expected programming language, your coding standards, and the desired output format. For best results, crop screenshots to relevant sections and use high-resolution images.

What are the limitations of multi-modal AI?

Current limitations include: inconsistent OCR accuracy in images, difficulty with complex diagrams, inability to interact with images (only observe), high token costs for image processing, and variable performance across models. Always validate AI's interpretation of images against ground truth for critical applications.

Guides & Tutorials13 March 202612 min readAI Prompt Architect

Multi-Modal Prompting: Working with Images, Code, and Text in GPT-4 and Claude --- ## Further Reading - [Chain-of-Thought Prompting: Advanced Techniques for Complex Reasoning](/blog/chain-of-thought-prompting-advanced-techniques-complex-reasoning) - [AI Code Review Workflows: 7 Templates That Catch Bugs](/blog/building-ai-powered-code-review-workflows-custom-prompts) - [What Is Prompt Engineering? A Complete Guide](/blog/what-is-prompt-engineering)

The Multi-Modal Shift

Modern LLMs understand more than text. GPT-4V, Claude 3.5, and Gemini Pro Vision can process images, diagrams, screenshots, and documents alongside text instructions. This fundamentally changes what prompts can do — you're no longer limited to describing things in words; you can show the model what you mean.

Multi-modal prompting isn't just "upload an image and ask a question." The real power comes from combining modalities strategically to solve problems that neither text nor vision alone can handle.

Vision + Text: Core Patterns

Pattern 1: Screenshot-to-Code

Convert UI screenshots or mockups directly into working code:

System: You are a senior frontend developer. Convert the provided UI screenshot 
into clean, semantic HTML and CSS.

Rules:
1. Use modern CSS (flexbox/grid, custom properties, clamp())
2. Match colours, spacing, and typography as closely as possible
3. Make it responsive (mobile-first)
4. Use semantic HTML5 elements
5. Add appropriate ARIA attributes for accessibility

[Attach: screenshot.png]

Generate the complete HTML and CSS for this design.

This pattern works best when you provide specific constraints about the tech stack and coding style. Without constraints, models tend to produce generic Bootstrap-style code.

Pattern 2: Diagram Analysis

Extract structured information from diagrams, flowcharts, and architecture drawings:

System: Analyse the attached architecture diagram and extract the following:

1. List all services/components shown
2. Map the data flow between components (source → destination)
3. Identify potential single points of failure
4. List all external dependencies (databases, APIs, third-party services)
5. Output the architecture as a Mermaid diagram in code

[Attach: architecture-diagram.png]

Pattern 3: Document Processing

Extract data from forms, receipts, invoices, and scanned documents:

System: Extract all data from this invoice image into a structured JSON format.

Required fields:
- invoice_number, date, due_date
- vendor (name, address, tax_id)
- line_items (description, quantity, unit_price, total)
- subtotal, tax_amount, tax_rate, total
- payment_terms

If a field is not visible or unclear, set it to null.

[Attach: invoice.pdf/png]

Image + Code: Advanced Patterns

Visual Regression Testing

Use vision models to compare UI states and identify visual regressions:

System: You are a QA engineer performing visual regression testing.

I'm providing two screenshots:
1. BASELINE: The expected/approved UI state
2. CURRENT: The current build's UI state

Compare these images and report:
1. Any visual differences (layout shifts, colour changes, missing elements, font changes)
2. Severity of each difference (BREAKING | MINOR | COSMETIC)
3. Likely CSS property that changed
4. Whether this looks intentional or accidental

[Attach: baseline.png, current.png]

Debug from Screenshots

When a user reports a bug with a screenshot, combine the visual with code context:

System: A user reported the following bug with the attached screenshot.

Bug report: "{user_description}"

Here is the relevant component code:
```tsx
{component_code}
```

Analyse the screenshot and code together:
1. What is the visible problem in the screenshot?
2. What part of the code is likely causing it?
3. Provide the fix

[Attach: bug-screenshot.png]

Multi-Image Prompting

Sending multiple images in a single prompt enables comparison, sequence analysis, and richer context:

Before/After comparisons — Show two states and ask for a diff
Multi-page documents — Process entire PDF-like documents page by page
Design systems — Show multiple component examples to establish a pattern, then generate new components
Sequential UI flows — Show a user journey across screens and identify UX issues

Model Comparison for Vision Tasks

Task	Best Model	Notes
Screenshot-to-code	Claude 3.5 Sonnet	Best at matching visual details; cleaner code output
Document extraction	GPT-4 Vision	Strong OCR; handles messy handwriting better
Diagram analysis	Gemini Pro Vision	Good spatial reasoning; handles complex diagrams well
Visual regression	Claude 3.5 Sonnet	Most reliable at spotting subtle pixel differences
Chart/graph reading	GPT-4 Vision	Best at extracting numerical data from charts

Cost Optimisation for Vision

Vision tokens are expensive. Key optimisation strategies:

Resize images — Most models scale images internally. Sending a 4K screenshot wastes tokens. Resize to 1024px or 768px width
Crop to region of interest — If you only care about a button, don't send the full page
Use low-detail mode — GPT-4V supports detail: "low" for tasks that don't need pixel-level precision
Cache results — If the same image is analysed repeatedly, cache the extracted data
Text over images when possible — If you can describe something in text, that's cheaper than an image

How AI Prompt Architect Helps

While AI Prompt Architect currently focuses on text-based prompt engineering, the prompt patterns generated by our Generate workflow are designed to be modality-aware. When you specify a vision-related task, the system scaffolds your prompt with appropriate image analysis instructions, output format specifications, and multi-modal best practices — saving you from the trial-and-error of crafting vision prompts from scratch.

Get the Prompt Engineering Playbook

Join 5,000+ developers receiving our weekly deep-dives on structured outputs, RAG optimisation, and advanced AI agent prompting.

multi-modalGPT-4 VisionClaudeimagespromptingvision

AI Prompt Architect

Author

Expert in prompt architecture and large language model optimization.