OpenAI Codex Prompting Guide: Master the Cloud-Native Coding Agent in 2026
OpenAI Codex Prompting Guide: Master the Cloud-Native Coding Agent in 2026
Structured prompting strategies, STCO templates, and cross-agent benchmarks from 100,000+ prompts tested on AI Prompt Architect.
What Is OpenAI Codex and Why It Matters in 2026
From GitHub Copilot Autocomplete to Autonomous Cloud Agent
The evolution of AI-assisted coding has been remarkably swift. GitHub Copilot launched in 2021 as a sophisticated autocomplete engine. By 2023, ChatGPT's Code Interpreter proved that LLMs could execute and debug code in real time. Then 2024 brought Cursor and Aider — tools that embedded AI directly into the development workflow. Now, in 2025–2026, Claude Code, Codex CLI, and Devin represent a paradigm shift: AI coding agents that autonomously plan, execute, test, and commit changes across entire codebases.
“Our platform has tracked this evolution across 100,000+ prompts, and the data is unambiguous: structured agent prompts outperform autocomplete-era mega-prompts by 3.8x.” — ExO Intelligence Council, AI Prompt Architect
This shift from autocomplete to autonomy is what developers now call vibe coding — describing intent rather than dictating implementation. But the quality of that description matters enormously. Understanding prompt engineering fundamentals is the prerequisite; mastering agent-specific prompting is the next frontier.
Codex CLI vs ChatGPT Codex — Understanding the Two Interfaces
OpenAI offers Codex through two distinct interfaces, and confusing them is a common mistake. Codex CLI is the terminal-native, open-source agent installed via npm. It runs locally, accepts any API key (OpenAI, Anthropic, Google, or local models via Ollama), and gives you full control over execution. ChatGPT Codex is the integrated web experience within ChatGPT Plus and Pro subscriptions — a simpler, no-setup interface where you describe a task in natural language and Codex spins up a cloud sandbox to execute it.
Critically, Codex CLI is released under the Apache 2.0 licence — making it the only open-source cloud-native coding agent available today. Neither Cursor, Claude Code, nor Devin offers this level of transparency and extensibility. For a broader view of available tools, see our best prompt engineering tools roundup.
The codex-1 Model — Architecture and Capabilities
Under the hood, Codex CLI is powered by the codex-1 model — a variant specifically trained for software engineering tasks, not a general-purpose LLM adapted for code. Key specifications include:
- 128k token context window — sufficient for most repositories, though smaller than Claude Code's 200k
- Cloud-sandboxed execution — Codex takes a snapshot of your repository, uploads it to an isolated environment, and processes tasks asynchronously
- Network-disabled sandbox in full-auto mode — prevents accidental data exfiltration, a critical security feature for enterprise teams
- Reasoning traces — every pull request includes a detailed log of the agent's decision-making process
Getting Started — Setup, Installation, and Configuration
Installing Codex CLI
Getting Codex CLI running takes under two minutes. You need Node.js 18+ and an API key from any supported provider:
npm install -g @openai/codex
export OPENAI_API_KEY=your-key-here
codex --version
For teams preferring Anthropic or Google models, set the corresponding environment variable (ANTHROPIC_API_KEY or GOOGLE_API_KEY) and specify the provider via CLI flags. Local models work through Ollama with zero cloud dependency.
AGENTS.md — Your Repository's Operating Manual for Codex
Every AI coding agent needs project-level configuration. For Codex CLI, this lives in AGENTS.md — a markdown file at your repository root that defines conventions, constraints, and hard rules. Codex reads this automatically before starting any task.
AGENTS.md is hierarchical: place it at the repo root for global rules, in subdirectories for module-specific overrides, or in ~/.codex/AGENTS.md for personal defaults that apply across all projects. For a detailed comparison of how this relates to CLAUDE.md and .cursor/rules, see our agent configuration comparison.
Here is a production snippet from our own AGENTS.md at AI Prompt Architect:
### ⛔ Hard Rules
- firebase deploy --only functions is BANNED.
It deploys 900+ functions and exhausts the CPU quota.
- Use pnpm --filter functions build prior to any deployment.
- When Firebase CLI fails, use gcloud directly.
- Never add workspace:* to functions/package.json.
Cloud Build uses npm and cannot resolve pnpm workspace protocols.
“We maintain AGENTS.md files across 14 internal repositories. Our data shows that agents operating with a well-structured AGENTS.md complete tasks 2.1x faster than those without project-level configuration.”
The Three Approval Modes: Suggest, Auto-Edit, and Full-Auto
Codex CLI offers three distinct approval modes, each balancing safety against productivity:
Mode Behaviour Best For
Suggest Read-only preview; asks before every file change and command Learning Codex, sensitive repositories, first-time setup
Auto-edit Modifies files autonomously; asks before terminal commands Daily development on trusted repositories
Full-auto Complete autonomy with network-disabled sandbox Test generation, documentation, batch processing
Our recommendation: start with suggest mode for the first week to understand how Codex reasons about your codebase. Graduate to auto-edit for trusted repositories where you have comprehensive test coverage. Reserve full-auto for sandboxed tasks like test generation and documentation — tasks where the network-disabled sandbox provides an additional layer of safety.
Multi-Provider Support — Using Codex with Claude, Gemini, and Local Models
This multi-provider flexibility is unique to Codex CLI. Neither Cursor (multi-model but IDE-locked), Claude Code (Claude-only), nor Devin (proprietary model) offers the same combination of cloud-sandboxed execution and open model choice. You can run Codex with Anthropic's Claude for superior reasoning, Google's Gemini for cost efficiency, or a locally hosted model via Ollama for complete data sovereignty.
STCO Framework for Codex Task Definitions
The STCO framework (Situation, Task, Context, Output) is the single most effective methodology for structuring prompts across all AI coding agents. When applied to Codex specifically, it addresses the unique challenges of asynchronous, cloud-sandboxed execution.
Why Generic Prompts Fail with Async Agents
Our internal benchmarks show STCO-structured prompts achieve an 87% average success rate across all AI coding agents, compared to 23% for unstructured prompts — a 3.8x improvement. This gap is even more pronounced with async agents like Codex, because the agent cannot ask clarifying questions mid-execution. A vague prompt sent to Codex results in 15 minutes of wasted sandbox compute and an incorrect deliverable. A structured STCO prompt produces a reviewable PR on the first attempt.
The Four STCO Elements Applied to Codex
STCO Element Codex Application
S Situation Cloud sandbox with repo snapshot, AGENTS.md defines testing conventions and coverage thresholds
T Task Generate comprehensive unit tests for all 34 exported functions in src/utils/
C Context Full repo snapshot in sandbox, 128k token window, existing test patterns in __tests__/
O Output Pull request with test files, coverage summary showing 95%+ line coverage, reasoning trace
Complete STCO Prompt Template for Codex
This template has been refined across 3,400+ Codex-specific prompts on our platform. Copy it directly and adapt the placeholders to your project:
# SITUATION
Node.js 20 monorepo, pnpm workspace, TypeScript strict mode.
AGENTS.md defines: Jest for testing, Prettier for formatting.
Existing test patterns in __tests__/ directory.
# TASK
Generate comprehensive unit tests for all 34 exported functions
in src/utils/.
# CONTEXT
- Follow existing test conventions in __tests__/ (Jest + Testing Library)
- Achieve 95%+ line coverage for each utility file
- Include edge cases: null inputs, empty arrays, boundary values
- Do NOT modify any source files in src/utils/
# OUTPUT
- New test files in __tests__/utils/
- Run the full test suite before submitting
- Create a PR with a descriptive title and coverage summary
For a deeper exploration of STCO across all major agents, read our STCO framework masterclass. To understand how context engineering differs from prompt engineering at a fundamental level, see our context engineering guide.
Side-by-Side — Codex vs Cursor vs Claude Code vs Devin
We have published dedicated guides for each of these agents — Cursor, Claude Code, and Devin — giving us a unique cross-agent perspective. For the full deep-dive comparison with benchmarks, see our Cursor vs Claude Code vs Codex vs Devin comparison.
Architecture Paradigms Compared
Each tool represents a fundamentally different architecture for AI-assisted development:
- IDE-embedded (Cursor): AI lives inside your editor. Lowest friction, tightest feedback loop. Best for inline edits and rapid iteration.
- Terminal-native (Claude Code): AI operates from your terminal with direct filesystem access and a 200k token context window. Maximum power for complex multi-file refactoring.
- Cloud-first async (Codex CLI): AI works in sandboxed cloud environments, processing tasks asynchronously. Ideal for fire-and-forget batch operations.
- Fully autonomous (Devin): AI runs in a complete cloud VM with browser, terminal, and editor. Researches documentation, writes code, and submits PRs independently.
Feature Comparison Matrix
Feature Cursor Claude Code Codex CLI Devin
Context Window ~120k tokens 200k tokens 128k tokens Unlimited (VM)
Multi-file Refactors ⚠️ Agent mode ✅ ✅ ✅
Git Integration ⚠️ Basic ✅ Native ✅ Auto PR ✅ Full
CI/CD Integration ❌ ✅ Headless ✅ ChatGPT ✅
Open Source ❌ ❌ ✅ Apache 2.0 ❌
Test Execution ⚠️ Via terminal ✅ Direct ✅ Sandboxed ✅ Full VM
Config File .cursor/rules CLAUDE.md AGENTS.md Knowledge base
MCP Support ✅ ✅ ⚠️ Limited ❌
Multi-model ✅ ❌ Claude only ✅ ❌ Proprietary
Pricing Comparison
Tool Free Tier Pro Enterprise Usage Model
Cursor Hobby (limited) $20/mo $40/user/mo Request-based
Claude Code ❌ $20/mo $200/mo (Max 20x) Subscription tiers
Codex CLI ✅ (BYOK) $20/mo (Plus) $200/mo (Pro) Token-based
Devin ❌ $500/mo Custom ACU-based
Disclosure: We have no affiliate relationship with any of these tools. All data comes from internal benchmarks across 100,000+ prompts on AI Prompt Architect.
When to Choose Codex Over Alternatives
Codex excels when you need fire-and-forget execution. Unlike Claude Code (which requires terminal interaction) or Cursor (which requires IDE presence), Codex works while you sleep. Our benchmarks show:
- 91% success rate on async batch tasks (test generation, documentation, dependency updates)
- 67% success rate on real-time interactive tasks (where Claude Code's 94% and Cursor's 89% are superior)
Choose Codex for: test generation at scale, automated PR creation, CI/CD pipeline integration, and batch documentation. Choose Claude Code for: complex multi-file reasoning and real-time refactoring. Choose Cursor for: inline edits and rapid iteration. Choose Devin for: fully autonomous greenfield projects.
The Multi-Tool Stack Strategy
Our recommended stack uses all three tools for different purposes: Cursor as the scratchpad for quick inline edits, Claude Code as the architect for deep reasoning and multi-file refactors, and Codex CLI as the workhorse for async batch operations. This combination covers 98% of development tasks at a combined cost of approximately £110 per month.
Advanced Codex Workflows and Orchestration
Async Batch Processing — Test Generation at Scale
The most compelling use case for Codex is batch test generation. Using Codex in full-auto mode, we generated comprehensive test suites for 34 utility functions in 18 minutes. The same task took a human engineer approximately 6 hours — a 20x acceleration.
The key is structuring each batch prompt as an atomic, verifiable work order. Each prompt targets a single file or module, includes specific coverage thresholds, and requires the agent to run the test suite before submitting. This prevents the cascading failures that occur when agents attempt to test an entire codebase in a single prompt.
CI/CD Integration — Codex in GitHub Actions
Codex CLI's headless mode enables direct pipeline integration. Practical use cases include:
- Automated PR review: Trigger Codex to analyse incoming PRs for security vulnerabilities and code quality
- Dependency update PRs: Schedule weekly Codex runs that update dependencies, run tests, and submit PRs automatically
- Test coverage enforcement: When coverage drops below a threshold, Codex generates the missing tests
For strategies on reducing API costs in pipeline integrations, see our prompt caching optimisation guide.
Plan Mode vs Agent Mode — Staging Your Execution
A critical workflow optimisation is separating planning from execution:
- Plan mode: The agent analyses the STCO prompt, reads the repository, and writes a detailed markdown proposal of the changes it intends to make. You review this architectural blueprint before any code is written.
- Agent mode: Once the plan is approved, a secondary execution phase implements the precise code changes, runs tests, and commits the diff.
We enforce plan-then-execute as a hard rule in our AGENTS.md for any refactoring task touching more than 5 files. This prevents the costly scenario where an agent writes 2,000 lines of code based on a misunderstood requirement.
Common Pitfalls and Anti-Patterns
The Sandbox Gotcha — Local State vs Cloud Snapshot
Codex operates on a snapshot of your repository, not the live filesystem. Any uncommitted changes, unstaged files, or local environment variables are invisible to the agent. This caught our team early: a developer made local changes, prompted Codex, and received a PR that conflicted with uncommitted work.
Rule: Always commit or stash your local changes before prompting Codex. The agent sees only what Git sees.
Over-Constraining the Agent
A frequent anti-pattern is dictating exact implementation steps. When you tell Codex “open file X, go to line 45, write this exact code,” you stifle its ability to dynamically problem-solve. If a dependency version has changed or an API has been deprecated, an over-constrained prompt prevents the agent from finding a workaround.
The fix: define the desired outcome and hard constraints, but let the agent choose the implementation path. Declare that a function must return cached data within 50ms — do not explain how to write the Redis SET/GET logic.
Context Window Management in 128k Tokens
Codex's 128k token context window is generous but smaller than Claude Code's 200k. Be selective with context injection. Rather than dumping your entire repository into the prompt, reference specific files and interfaces that are relevant to the task. Our platform data shows that setting temperature to 0.2 for code generation tasks increases formatting compliance by 40%.
For deeper strategies on maximising context efficiency, see our context engineering guide.
Security Considerations for Cloud-Sandboxed Execution
Codex's full-auto mode runs with network access disabled. This is a deliberate security decision: the agent cannot make outbound HTTP requests, which prevents accidental data exfiltration and supply chain attacks. However, this also means the agent cannot fetch external documentation or install packages from registries during execution.
Never include production API keys, database credentials, or secrets in prompts. Use mock values for testing, and reference environment variable names rather than actual values in your AGENTS.md.
Frequently Asked Questions
What is OpenAI Codex CLI and how does it differ from ChatGPT?
Codex CLI is OpenAI's open-source (Apache 2.0) terminal agent that executes coding tasks in cloud sandboxes. It's installed via npm and accepts API keys from multiple providers. ChatGPT Codex is the integrated web experience within ChatGPT Plus ($20/mo) and Pro ($200/mo) subscriptions — a simpler, no-setup interface. CLI offers BYOK (bring your own key) and multi-provider support; ChatGPT offers a more accessible interface for occasional use.
Is Codex CLI free to use?
The CLI itself is free and open source under the Apache 2.0 licence. You need an API key to power it — OpenAI charges approximately $2/$8 per million tokens (input/output). Alternatively, bring your own key from Anthropic, Google, or run local models via Ollama at zero cloud cost. ChatGPT Plus ($20/mo) and Pro ($200/mo) include the web-based Codex integration.
How does Codex compare to Claude Code for coding tasks?
Claude Code achieves a 94% success rate on multi-file refactoring with its 200k token context. Codex CLI hits 91% on async batch tasks with 128k tokens. Choose Claude Code for interactive, complex reasoning; choose Codex for fire-and-forget batch operations like test generation. For detailed benchmarks, see our full comparison.
What is AGENTS.md and how do I set it up for Codex?
AGENTS.md is a markdown file at your repository root that defines project conventions, constraints, and rules for AI agents. Codex reads it automatically before starting any task. It's hierarchical — place files at the repo root for global rules, in subdirectories for module-specific overrides, or at ~/.codex/AGENTS.md for personal defaults. See our configuration comparison for setup patterns.
Can Codex CLI work offline or with local models?
Codex CLI requires internet connectivity for its cloud sandbox execution environment. However, it supports local language models via Ollama, meaning the AI reasoning can happen entirely on your machine. The trade-off is losing the sandboxed execution environment — local execution carries the same risks as running any agent on your filesystem.
What are the three approval modes in Codex CLI?
Suggest mode is read-only and asks before every action (safest). Auto-edit mode modifies files autonomously but asks before terminal commands (balanced). Full-auto mode provides complete autonomy with a network-disabled sandbox (most productive). We recommend starting with suggest, graduating to auto-edit for trusted repositories, and reserving full-auto for sandboxed batch tasks.
How do I structure prompts for Codex using the STCO framework?
Use the four-element structure: Situation (cloud sandbox context and AGENTS.md conventions), Task (atomic, verifiable objective), Context (specific file references and constraints), Output (acceptance criteria and verification commands). Our benchmarks show this structure achieves 87% success versus 23% for unstructured prompts. See our STCO framework guide for ready-to-use templates.
Get the Prompt Engineering Playbook
Join 5,000+ developers receiving our weekly deep-dives on structured outputs, RAG optimisation, and advanced AI agent prompting.
CodexOpenAIPrompt EngineeringAI AgentsSTCO FrameworkAgent OSCoding AgentsAGENTS.mdExO Intelligence Council
AuthorExpert in prompt architecture and large language model optimization.
OpenAI Codex Prompting Guide: Master the Cloud-Native Coding Agent in 2026
Structured prompting strategies, STCO templates, and cross-agent benchmarks from 100,000+ prompts tested on AI Prompt Architect.
What Is OpenAI Codex and Why It Matters in 2026
From GitHub Copilot Autocomplete to Autonomous Cloud Agent
The evolution of AI-assisted coding has been remarkably swift. GitHub Copilot launched in 2021 as a sophisticated autocomplete engine. By 2023, ChatGPT's Code Interpreter proved that LLMs could execute and debug code in real time. Then 2024 brought Cursor and Aider — tools that embedded AI directly into the development workflow. Now, in 2025–2026, Claude Code, Codex CLI, and Devin represent a paradigm shift: AI coding agents that autonomously plan, execute, test, and commit changes across entire codebases.
This shift from autocomplete to autonomy is what developers now call vibe coding — describing intent rather than dictating implementation. But the quality of that description matters enormously. Understanding prompt engineering fundamentals is the prerequisite; mastering agent-specific prompting is the next frontier.
Codex CLI vs ChatGPT Codex — Understanding the Two Interfaces
OpenAI offers Codex through two distinct interfaces, and confusing them is a common mistake. Codex CLI is the terminal-native, open-source agent installed via npm. It runs locally, accepts any API key (OpenAI, Anthropic, Google, or local models via Ollama), and gives you full control over execution. ChatGPT Codex is the integrated web experience within ChatGPT Plus and Pro subscriptions — a simpler, no-setup interface where you describe a task in natural language and Codex spins up a cloud sandbox to execute it.
Critically, Codex CLI is released under the Apache 2.0 licence — making it the only open-source cloud-native coding agent available today. Neither Cursor, Claude Code, nor Devin offers this level of transparency and extensibility. For a broader view of available tools, see our best prompt engineering tools roundup.
The codex-1 Model — Architecture and Capabilities
Under the hood, Codex CLI is powered by the codex-1 model — a variant specifically trained for software engineering tasks, not a general-purpose LLM adapted for code. Key specifications include:
- 128k token context window — sufficient for most repositories, though smaller than Claude Code's 200k
- Cloud-sandboxed execution — Codex takes a snapshot of your repository, uploads it to an isolated environment, and processes tasks asynchronously
- Network-disabled sandbox in full-auto mode — prevents accidental data exfiltration, a critical security feature for enterprise teams
- Reasoning traces — every pull request includes a detailed log of the agent's decision-making process
Getting Started — Setup, Installation, and Configuration
Installing Codex CLI
Getting Codex CLI running takes under two minutes. You need Node.js 18+ and an API key from any supported provider:
For teams preferring Anthropic or Google models, set the corresponding environment variable (ANTHROPIC_API_KEY or GOOGLE_API_KEY) and specify the provider via CLI flags. Local models work through Ollama with zero cloud dependency.
AGENTS.md — Your Repository's Operating Manual for Codex
Every AI coding agent needs project-level configuration. For Codex CLI, this lives in AGENTS.md — a markdown file at your repository root that defines conventions, constraints, and hard rules. Codex reads this automatically before starting any task.
AGENTS.md is hierarchical: place it at the repo root for global rules, in subdirectories for module-specific overrides, or in ~/.codex/AGENTS.md for personal defaults that apply across all projects. For a detailed comparison of how this relates to CLAUDE.md and .cursor/rules, see our agent configuration comparison.
Here is a production snippet from our own AGENTS.md at AI Prompt Architect:
The Three Approval Modes: Suggest, Auto-Edit, and Full-Auto
Codex CLI offers three distinct approval modes, each balancing safety against productivity:
| Mode | Behaviour | Best For |
|---|---|---|
| Suggest | Read-only preview; asks before every file change and command | Learning Codex, sensitive repositories, first-time setup |
| Auto-edit | Modifies files autonomously; asks before terminal commands | Daily development on trusted repositories |
| Full-auto | Complete autonomy with network-disabled sandbox | Test generation, documentation, batch processing |
Our recommendation: start with suggest mode for the first week to understand how Codex reasons about your codebase. Graduate to auto-edit for trusted repositories where you have comprehensive test coverage. Reserve full-auto for sandboxed tasks like test generation and documentation — tasks where the network-disabled sandbox provides an additional layer of safety.
Multi-Provider Support — Using Codex with Claude, Gemini, and Local Models
This multi-provider flexibility is unique to Codex CLI. Neither Cursor (multi-model but IDE-locked), Claude Code (Claude-only), nor Devin (proprietary model) offers the same combination of cloud-sandboxed execution and open model choice. You can run Codex with Anthropic's Claude for superior reasoning, Google's Gemini for cost efficiency, or a locally hosted model via Ollama for complete data sovereignty.
STCO Framework for Codex Task Definitions
The STCO framework (Situation, Task, Context, Output) is the single most effective methodology for structuring prompts across all AI coding agents. When applied to Codex specifically, it addresses the unique challenges of asynchronous, cloud-sandboxed execution.
Why Generic Prompts Fail with Async Agents
Our internal benchmarks show STCO-structured prompts achieve an 87% average success rate across all AI coding agents, compared to 23% for unstructured prompts — a 3.8x improvement. This gap is even more pronounced with async agents like Codex, because the agent cannot ask clarifying questions mid-execution. A vague prompt sent to Codex results in 15 minutes of wasted sandbox compute and an incorrect deliverable. A structured STCO prompt produces a reviewable PR on the first attempt.
The Four STCO Elements Applied to Codex
| STCO | Element | Codex Application |
|---|---|---|
| S | Situation | Cloud sandbox with repo snapshot, AGENTS.md defines testing conventions and coverage thresholds |
| T | Task | Generate comprehensive unit tests for all 34 exported functions in src/utils/ |
| C | Context | Full repo snapshot in sandbox, 128k token window, existing test patterns in __tests__/ |
| O | Output | Pull request with test files, coverage summary showing 95%+ line coverage, reasoning trace |
Complete STCO Prompt Template for Codex
This template has been refined across 3,400+ Codex-specific prompts on our platform. Copy it directly and adapt the placeholders to your project:
For a deeper exploration of STCO across all major agents, read our STCO framework masterclass. To understand how context engineering differs from prompt engineering at a fundamental level, see our context engineering guide.
Side-by-Side — Codex vs Cursor vs Claude Code vs Devin
We have published dedicated guides for each of these agents — Cursor, Claude Code, and Devin — giving us a unique cross-agent perspective. For the full deep-dive comparison with benchmarks, see our Cursor vs Claude Code vs Codex vs Devin comparison.
Architecture Paradigms Compared
Each tool represents a fundamentally different architecture for AI-assisted development:
- IDE-embedded (Cursor): AI lives inside your editor. Lowest friction, tightest feedback loop. Best for inline edits and rapid iteration.
- Terminal-native (Claude Code): AI operates from your terminal with direct filesystem access and a 200k token context window. Maximum power for complex multi-file refactoring.
- Cloud-first async (Codex CLI): AI works in sandboxed cloud environments, processing tasks asynchronously. Ideal for fire-and-forget batch operations.
- Fully autonomous (Devin): AI runs in a complete cloud VM with browser, terminal, and editor. Researches documentation, writes code, and submits PRs independently.
Feature Comparison Matrix
| Feature | Cursor | Claude Code | Codex CLI | Devin |
|---|---|---|---|---|
| Context Window | ~120k tokens | 200k tokens | 128k tokens | Unlimited (VM) |
| Multi-file Refactors | ⚠️ Agent mode | ✅ | ✅ | ✅ |
| Git Integration | ⚠️ Basic | ✅ Native | ✅ Auto PR | ✅ Full |
| CI/CD Integration | ❌ | ✅ Headless | ✅ ChatGPT | ✅ |
| Open Source | ❌ | ❌ | ✅ Apache 2.0 | ❌ |
| Test Execution | ⚠️ Via terminal | ✅ Direct | ✅ Sandboxed | ✅ Full VM |
| Config File | .cursor/rules | CLAUDE.md | AGENTS.md | Knowledge base |
| MCP Support | ✅ | ✅ | ⚠️ Limited | ❌ |
| Multi-model | ✅ | ❌ Claude only | ✅ | ❌ Proprietary |
Pricing Comparison
| Tool | Free Tier | Pro | Enterprise | Usage Model |
|---|---|---|---|---|
| Cursor | Hobby (limited) | $20/mo | $40/user/mo | Request-based |
| Claude Code | ❌ | $20/mo | $200/mo (Max 20x) | Subscription tiers |
| Codex CLI | ✅ (BYOK) | $20/mo (Plus) | $200/mo (Pro) | Token-based |
| Devin | ❌ | $500/mo | Custom | ACU-based |
Disclosure: We have no affiliate relationship with any of these tools. All data comes from internal benchmarks across 100,000+ prompts on AI Prompt Architect.
When to Choose Codex Over Alternatives
Codex excels when you need fire-and-forget execution. Unlike Claude Code (which requires terminal interaction) or Cursor (which requires IDE presence), Codex works while you sleep. Our benchmarks show:
- 91% success rate on async batch tasks (test generation, documentation, dependency updates)
- 67% success rate on real-time interactive tasks (where Claude Code's 94% and Cursor's 89% are superior)
Choose Codex for: test generation at scale, automated PR creation, CI/CD pipeline integration, and batch documentation. Choose Claude Code for: complex multi-file reasoning and real-time refactoring. Choose Cursor for: inline edits and rapid iteration. Choose Devin for: fully autonomous greenfield projects.
The Multi-Tool Stack Strategy
Our recommended stack uses all three tools for different purposes: Cursor as the scratchpad for quick inline edits, Claude Code as the architect for deep reasoning and multi-file refactors, and Codex CLI as the workhorse for async batch operations. This combination covers 98% of development tasks at a combined cost of approximately £110 per month.
Advanced Codex Workflows and Orchestration
Async Batch Processing — Test Generation at Scale
The most compelling use case for Codex is batch test generation. Using Codex in full-auto mode, we generated comprehensive test suites for 34 utility functions in 18 minutes. The same task took a human engineer approximately 6 hours — a 20x acceleration.
The key is structuring each batch prompt as an atomic, verifiable work order. Each prompt targets a single file or module, includes specific coverage thresholds, and requires the agent to run the test suite before submitting. This prevents the cascading failures that occur when agents attempt to test an entire codebase in a single prompt.
CI/CD Integration — Codex in GitHub Actions
Codex CLI's headless mode enables direct pipeline integration. Practical use cases include:
- Automated PR review: Trigger Codex to analyse incoming PRs for security vulnerabilities and code quality
- Dependency update PRs: Schedule weekly Codex runs that update dependencies, run tests, and submit PRs automatically
- Test coverage enforcement: When coverage drops below a threshold, Codex generates the missing tests
For strategies on reducing API costs in pipeline integrations, see our prompt caching optimisation guide.
Plan Mode vs Agent Mode — Staging Your Execution
A critical workflow optimisation is separating planning from execution:
- Plan mode: The agent analyses the STCO prompt, reads the repository, and writes a detailed markdown proposal of the changes it intends to make. You review this architectural blueprint before any code is written.
- Agent mode: Once the plan is approved, a secondary execution phase implements the precise code changes, runs tests, and commits the diff.
We enforce plan-then-execute as a hard rule in our AGENTS.md for any refactoring task touching more than 5 files. This prevents the costly scenario where an agent writes 2,000 lines of code based on a misunderstood requirement.
Common Pitfalls and Anti-Patterns
The Sandbox Gotcha — Local State vs Cloud Snapshot
Codex operates on a snapshot of your repository, not the live filesystem. Any uncommitted changes, unstaged files, or local environment variables are invisible to the agent. This caught our team early: a developer made local changes, prompted Codex, and received a PR that conflicted with uncommitted work.
Over-Constraining the Agent
A frequent anti-pattern is dictating exact implementation steps. When you tell Codex “open file X, go to line 45, write this exact code,” you stifle its ability to dynamically problem-solve. If a dependency version has changed or an API has been deprecated, an over-constrained prompt prevents the agent from finding a workaround.
The fix: define the desired outcome and hard constraints, but let the agent choose the implementation path. Declare that a function must return cached data within 50ms — do not explain how to write the Redis SET/GET logic.
Context Window Management in 128k Tokens
Codex's 128k token context window is generous but smaller than Claude Code's 200k. Be selective with context injection. Rather than dumping your entire repository into the prompt, reference specific files and interfaces that are relevant to the task. Our platform data shows that setting temperature to 0.2 for code generation tasks increases formatting compliance by 40%.
For deeper strategies on maximising context efficiency, see our context engineering guide.
Security Considerations for Cloud-Sandboxed Execution
Codex's full-auto mode runs with network access disabled. This is a deliberate security decision: the agent cannot make outbound HTTP requests, which prevents accidental data exfiltration and supply chain attacks. However, this also means the agent cannot fetch external documentation or install packages from registries during execution.
Never include production API keys, database credentials, or secrets in prompts. Use mock values for testing, and reference environment variable names rather than actual values in your AGENTS.md.
Frequently Asked Questions
What is OpenAI Codex CLI and how does it differ from ChatGPT?
Codex CLI is OpenAI's open-source (Apache 2.0) terminal agent that executes coding tasks in cloud sandboxes. It's installed via npm and accepts API keys from multiple providers. ChatGPT Codex is the integrated web experience within ChatGPT Plus ($20/mo) and Pro ($200/mo) subscriptions — a simpler, no-setup interface. CLI offers BYOK (bring your own key) and multi-provider support; ChatGPT offers a more accessible interface for occasional use.
Is Codex CLI free to use?
The CLI itself is free and open source under the Apache 2.0 licence. You need an API key to power it — OpenAI charges approximately $2/$8 per million tokens (input/output). Alternatively, bring your own key from Anthropic, Google, or run local models via Ollama at zero cloud cost. ChatGPT Plus ($20/mo) and Pro ($200/mo) include the web-based Codex integration.
How does Codex compare to Claude Code for coding tasks?
Claude Code achieves a 94% success rate on multi-file refactoring with its 200k token context. Codex CLI hits 91% on async batch tasks with 128k tokens. Choose Claude Code for interactive, complex reasoning; choose Codex for fire-and-forget batch operations like test generation. For detailed benchmarks, see our full comparison.
What is AGENTS.md and how do I set it up for Codex?
AGENTS.md is a markdown file at your repository root that defines project conventions, constraints, and rules for AI agents. Codex reads it automatically before starting any task. It's hierarchical — place files at the repo root for global rules, in subdirectories for module-specific overrides, or at ~/.codex/AGENTS.md for personal defaults. See our configuration comparison for setup patterns.
Can Codex CLI work offline or with local models?
Codex CLI requires internet connectivity for its cloud sandbox execution environment. However, it supports local language models via Ollama, meaning the AI reasoning can happen entirely on your machine. The trade-off is losing the sandboxed execution environment — local execution carries the same risks as running any agent on your filesystem.
What are the three approval modes in Codex CLI?
Suggest mode is read-only and asks before every action (safest). Auto-edit mode modifies files autonomously but asks before terminal commands (balanced). Full-auto mode provides complete autonomy with a network-disabled sandbox (most productive). We recommend starting with suggest, graduating to auto-edit for trusted repositories, and reserving full-auto for sandboxed batch tasks.
How do I structure prompts for Codex using the STCO framework?
Use the four-element structure: Situation (cloud sandbox context and AGENTS.md conventions), Task (atomic, verifiable objective), Context (specific file references and constraints), Output (acceptance criteria and verification commands). Our benchmarks show this structure achieves 87% success versus 23% for unstructured prompts. See our STCO framework guide for ready-to-use templates.
Get the Prompt Engineering Playbook
Join 5,000+ developers receiving our weekly deep-dives on structured outputs, RAG optimisation, and advanced AI agent prompting.
ExO Intelligence Council
AuthorExpert in prompt architecture and large language model optimization.
