Multi-Modal Prompting: Working with Images, Code, and Text in GPT-4 and Claude
The Multi-Modal Shift
Modern LLMs understand more than text. GPT-4V, Claude 3.5, and Gemini Pro Vision can process images, diagrams, screenshots, and documents alongside text instructions. This fundamentally changes what prompts can do — you're no longer limited to describing things in words; you can show the model what you mean.
Multi-modal prompting isn't just "upload an image and ask a question." The real power comes from combining modalities strategically to solve problems that neither text nor vision alone can handle.
Vision + Text: Core Patterns
Pattern 1: Screenshot-to-Code
Convert UI screenshots or mockups directly into working code:
System: You are a senior frontend developer. Convert the provided UI screenshot
into clean, semantic HTML and CSS.
Rules:
1. Use modern CSS (flexbox/grid, custom properties, clamp())
2. Match colours, spacing, and typography as closely as possible
3. Make it responsive (mobile-first)
4. Use semantic HTML5 elements
5. Add appropriate ARIA attributes for accessibility
[Attach: screenshot.png]
Generate the complete HTML and CSS for this design.
This pattern works best when you provide specific constraints about the tech stack and coding style. Without constraints, models tend to produce generic Bootstrap-style code.
Pattern 2: Diagram Analysis
Extract structured information from diagrams, flowcharts, and architecture drawings:
System: Analyse the attached architecture diagram and extract the following:
1. List all services/components shown
2. Map the data flow between components (source → destination)
3. Identify potential single points of failure
4. List all external dependencies (databases, APIs, third-party services)
5. Output the architecture as a Mermaid diagram in code
[Attach: architecture-diagram.png]
Pattern 3: Document Processing
Extract data from forms, receipts, invoices, and scanned documents:
System: Extract all data from this invoice image into a structured JSON format.
Required fields:
- invoice_number, date, due_date
- vendor (name, address, tax_id)
- line_items (description, quantity, unit_price, total)
- subtotal, tax_amount, tax_rate, total
- payment_terms
If a field is not visible or unclear, set it to null.
[Attach: invoice.pdf/png]
Image + Code: Advanced Patterns
Visual Regression Testing
Use vision models to compare UI states and identify visual regressions:
System: You are a QA engineer performing visual regression testing.
I'm providing two screenshots:
1. BASELINE: The expected/approved UI state
2. CURRENT: The current build's UI state
Compare these images and report:
1. Any visual differences (layout shifts, colour changes, missing elements, font changes)
2. Severity of each difference (BREAKING | MINOR | COSMETIC)
3. Likely CSS property that changed
4. Whether this looks intentional or accidental
[Attach: baseline.png, current.png]
Debug from Screenshots
When a user reports a bug with a screenshot, combine the visual with code context:
System: A user reported the following bug with the attached screenshot.
Bug report: "{user_description}"
Here is the relevant component code:
```tsx
{component_code}
```
Analyse the screenshot and code together:
1. What is the visible problem in the screenshot?
2. What part of the code is likely causing it?
3. Provide the fix
[Attach: bug-screenshot.png]
Multi-Image Prompting
Sending multiple images in a single prompt enables comparison, sequence analysis, and richer context:
- Before/After comparisons — Show two states and ask for a diff
- Multi-page documents — Process entire PDF-like documents page by page
- Design systems — Show multiple component examples to establish a pattern, then generate new components
- Sequential UI flows — Show a user journey across screens and identify UX issues
Model Comparison for Vision Tasks
| Task | Best Model | Notes |
|---|---|---|
| Screenshot-to-code | Claude 3.5 Sonnet | Best at matching visual details; cleaner code output |
| Document extraction | GPT-4 Vision | Strong OCR; handles messy handwriting better |
| Diagram analysis | Gemini Pro Vision | Good spatial reasoning; handles complex diagrams well |
| Visual regression | Claude 3.5 Sonnet | Most reliable at spotting subtle pixel differences |
| Chart/graph reading | GPT-4 Vision | Best at extracting numerical data from charts |
Cost Optimisation for Vision
Vision tokens are expensive. Key optimisation strategies:
- Resize images — Most models scale images internally. Sending a 4K screenshot wastes tokens. Resize to 1024px or 768px width
- Crop to region of interest — If you only care about a button, don't send the full page
- Use low-detail mode — GPT-4V supports
detail: "low"for tasks that don't need pixel-level precision - Cache results — If the same image is analysed repeatedly, cache the extracted data
- Text over images when possible — If you can describe something in text, that's cheaper than an image
How AI Prompt Architect Helps
While AI Prompt Architect currently focuses on text-based prompt engineering, the prompt patterns generated by our Generate workflow are designed to be modality-aware. When you specify a vision-related task, the system scaffolds your prompt with appropriate image analysis instructions, output format specifications, and multi-modal best practices — saving you from the trial-and-error of crafting vision prompts from scratch.
