If you are building with large language models, the hardest decision is often not which model to pick but how much control to give it. Should the model return strict JSON, request a tool call, or participate in a broader tool-use loop where it reasons, selects actions, and continues? This guide compares function calling, tool use, and JSON mode as practical LLM control patterns. The goal is not to crown a universal winner, but to help you choose the right pattern for reliability, latency, implementation effort, and long-term maintainability.
Overview
Developers usually meet these patterns when plain text stops being enough. A chatbot needs to create tickets, a support assistant must fetch account data, or an internal copilot has to produce structured output for downstream systems. At that point, prompt engineering alone is not enough. You need a control pattern that constrains how the model speaks and how your application responds.
At a high level, the three patterns solve different problems:
- JSON mode is best when you need the model to produce structured data and nothing else. Think extraction, classification, routing, or templated object generation.
- Function calling is best when the model needs to choose from predefined actions with known schemas. Think API execution, database lookups, or application workflows with clear boundaries.
- Tool use is best when the model needs a more agent-like loop: inspect context, choose a tool, observe a result, then decide what to do next. Think research agents, multi-step assistants, and orchestration-heavy flows.
These categories overlap, and different model providers package them differently. Some call everything tool use. Some expose function calling as a specific form of structured tool invocation. Some offer structured output features that behave like stricter JSON mode. For stack selection, the naming matters less than the control surface you actually get.
A useful way to think about the choice is this:
- If you only need shape control, start with JSON mode.
- If you need safe action selection, start with function calling.
- If you need iterative reasoning with external systems, start with tool use.
That framing keeps teams from overbuilding. Many applications described as “agents” are really structured output pipelines. Many “tool use” needs are satisfied by one function call and a deterministic application layer. The simplest control pattern that reliably solves the problem is usually the best one.
How to compare options
The best comparison is not feature marketing. It is operational fit. Before choosing a pattern, define the job your model must perform and the failure modes your system cannot tolerate.
Use these six criteria.
1. Output reliability
Ask how strictly you need the response format to hold. If a malformed field breaks your parser or your queue consumer, reliability matters more than conversational flexibility. JSON mode usually wins for narrow structured output tasks because it reduces formatting drift. Function calling can be similarly reliable when the schema is clear and the action space is small. Tool use tends to be less predictable because the model is participating in a larger loop, often with more chances to wander.
2. Action safety
Ask whether the model is allowed to trigger real-world side effects. Sending emails, creating incidents, updating records, and calling third-party APIs all raise the bar. Function calling is often the cleanest choice here because you can whitelist available functions, validate arguments, and enforce human approval before execution. Tool use can still be safe, but only if your orchestration layer is disciplined about permissions, retries, and stop conditions.
3. Latency
Ask how many round trips the interaction can tolerate. JSON mode is usually the fastest because it often completes in a single generation. Function calling may add one extra step: the model selects a function, your app executes it, and then the model may summarize or finalize. Tool use can be the slowest because it often involves several turns, multiple tools, and more tokens spent on reasoning and state.
4. Engineering complexity
Ask how much infrastructure your team wants to own. JSON mode is often straightforward: define a schema, prompt for it, validate, and recover on failure. Function calling adds schema definitions, execution handlers, validation layers, and audit logic. Tool use adds orchestration, loop management, memory handling, error recovery, tracing, and usually stronger evaluation requirements. Complexity is not bad if the task needs it. It is expensive if it does not.
5. Observability and testing
Ask how easy it will be to debug failures in production. JSON mode failures are often visible as parse errors, missing fields, or schema mismatches. Function calling failures are usually more inspectable because you can log selected functions and arguments. Tool use is the hardest to test because you need to evaluate both the quality of decisions and the correctness of tool sequences. If your team does not yet have a strong prompt testing framework, start with simpler patterns and expand later. For broader reliability practices, see Testing Playbooks for Conversational Personas: Unit, Integration, and Red-Teaming Approaches.
6. Portability across vendors
Ask whether you may switch model providers later. JSON-shaped prompting is conceptually portable, even if structured output features differ. Function calling and tool use APIs vary more. Schemas, tool message formats, and control semantics may change across vendors. If portability matters, keep your application layer abstract: define internal tool contracts and adapt provider-specific APIs behind them.
A simple rule helps here: compare what your app guarantees, not what the model advertises. The pattern that minimizes hidden ambiguity will usually be easier to maintain.
Feature-by-feature breakdown
This section gives a practical comparison of how each pattern behaves in real systems.
JSON mode
What it is: The model is instructed or constrained to return valid JSON that matches a target shape.
Best for: Extraction, classification, tagging, routing, metadata generation, content normalization, and any workflow where the output feeds another deterministic system.
Strengths:
- Simple mental model.
- Fast and often low-latency.
- Works well with validators and typed application code.
- Good fit for prompt optimization because success criteria are measurable.
Weaknesses:
- It does not execute actions by itself.
- Complex nested schemas can still fail.
- Strict formatting does not guarantee factual correctness.
- Recovery logic is still needed for invalid or incomplete responses.
Use it when: You want a structured output LLM pattern without giving the model action authority. For many internal tools, this is the safest default.
Common mistake: Teams treat JSON mode as a truth guarantee. It is only a formatting and structure control. You still need grounding, validation, and business rules. If your output depends on retrieved knowledge, combine this with retrieval patterns rather than hoping structure alone will reduce hallucinations. Related reading: RAG Architecture Guide: Choosing Chunking, Retrieval, and Re-Ranking Strategies.
Function calling
What it is: The model chooses from a set of predefined functions and produces arguments that your application can validate and execute.
Best for: Transactional workflows, API integrations, database operations, search requests, workflow automation, and assistants that need to act but within narrow lanes.
Strengths:
- Clear action boundary between model and application.
- Stronger safety posture than free-form action generation.
- Good developer ergonomics when tools have explicit schemas.
- Easier to audit which action was selected and why.
Weaknesses:
- Schema design matters more than many teams expect.
- Poorly scoped functions can confuse the model.
- Too many similar tools can lower selection quality.
- Cross-provider behavior may vary.
Use it when: You need the model to decide whether to act and which action to take, but your system still owns execution, permissions, and final authority.
Common mistake: Defining functions that are too broad. A single “do_everything” tool defeats the point of the pattern. Narrow, semantically distinct tools are easier for the model to select and easier for your team to test.
Function calling also pairs well with strong system prompts. If tool selection quality is poor, the issue is often not the API feature but unclear role definition, missing decision rules, or fuzzy escalation instructions. See System Prompt Best Practices: A Living Guide for Reliable AI Assistants.
Tool use
What it is: A broader orchestration pattern in which the model can iteratively call tools, inspect outputs, and continue toward a goal.
Best for: Agent workflows, open-ended research, multi-step troubleshooting, dynamic planning, and environments where the next step depends on prior tool results.
Strengths:
- Flexible for messy, real-world tasks.
- Can reduce application-side branching for exploratory workflows.
- Useful when a fixed one-shot prompt is not enough.
- Supports richer AI workflow automation patterns.
Weaknesses:
- Higher latency and token usage.
- More failure points across loops and tools.
- Harder to evaluate deterministically.
- Greater need for stop rules, timeouts, and cost controls.
Use it when: The job truly requires iterative interaction with external systems and intermediate reasoning. If the path can be made deterministic, do that instead.
Common mistake: Using tool use as a substitute for workflow design. Many teams reach for agent loops before they define a good sequence. In production, a prompt chain plus a narrow function-calling layer is often easier to reason about than a fully open tool-use agent. For concrete sequencing patterns, see Prompt Chaining Patterns That Actually Work in Production.
A practical comparison table in words
If you need the shortest possible summary:
- Highest format control: JSON mode
- Best balance of actionability and safety: Function calling
- Most flexible for agent workflows: Tool use
- Lowest implementation overhead: JSON mode
- Best default for production side effects: Function calling with validation
- Highest need for observability and guardrails: Tool use
That makes the core tradeoff clear: flexibility rises as control and predictability become harder to manage.
Best fit by scenario
Most teams do not need abstract definitions. They need to map patterns to real use cases. Here is a practical scenario-based guide.
Scenario 1: Build an intake classifier or router
If the model only needs to read text and return labels, categories, confidence bands, or structured fields, choose JSON mode. It is usually enough for ticket triage, lead qualification, moderation queues, and internal content tagging. Add schema validation and fallback rules in your app.
Scenario 2: Build an AI chatbot that can take safe actions
If the assistant needs to open tickets, look up orders, search documentation, or trigger approved automations, choose function calling. This gives you a cleaner contract between the model and your backend. It is often the right middle ground for teams moving from prompt engineering to real LLM app development.
Scenario 3: Build a research or operations copilot
If the assistant needs to search, inspect multiple sources, compare findings, then choose follow-up tools based on what it sees, choose tool use. But put hard limits around retries, tool count, and elapsed time. Without these controls, flexibility turns into cost and inconsistency.
Scenario 4: Generate business-ready payloads for another system
If your application sends the model output into a workflow engine, data pipeline, or form renderer, choose JSON mode first. This is especially true when downstream systems require exact fields. In many cases, you can keep the model focused on producing structured intent, then let deterministic code decide what to do next.
Scenario 5: Orchestrate a multi-step support workflow
If the task includes retrieval, policy checks, action recommendations, and selective automation, the answer may be hybrid. Use retrieval and prompt chaining for information gathering, function calling for approved actions, and structured output for final state reporting. The best stack is often a combination rather than a single pattern. For support-heavy design considerations, see Empathetic Automation: Designing AI Systems That Reduce Friction for Support Teams.
Scenario 6: Build for regulated or sensitive domains
When auditability and reversibility matter, start with JSON mode or function calling, not open-ended tool use. Keep side effects deterministic, require validation, and separate recommendation from execution whenever possible.
A simple selection heuristic
If you want one decision tree, use this:
- Does the model need to take actions? If no, use JSON mode.
- If yes, are the actions predefined and limited? If yes, use function calling.
- If the action path depends on iterative discovery across multiple tools, consider tool use.
- If you are unsure, start one level simpler than your instincts suggest.
That last step matters. Simpler patterns are easier to evaluate, cheaper to operate, and easier to hand off across teams.
When to revisit
This comparison should be revisited whenever your application requirements change, not just when model vendors ship new features. The right pattern at prototype stage may be the wrong one at scale.
Review your choice when any of these happen:
- Your latency budget tightens. A tool-use loop that felt acceptable in testing may be too slow for a production chatbot.
- Your side effects become riskier. Moving from internal summaries to customer-visible actions often means shifting from open tool use toward stricter function calling and approval layers.
- Your schemas become more complex. If JSON reliability starts dropping, simplify the output shape or split one generation into a prompt chain.
- Your tool count grows. As the number of tools increases, function selection quality can degrade. You may need better tool descriptions, routing logic, or domain-specific tool groups.
- Your model provider changes features or policies. Structured output support, tool semantics, and API ergonomics evolve. Re-test your assumptions before expanding usage.
- Your evaluation discipline improves. Once you have tracing, replay, and benchmark prompts, you may be able to safely adopt more advanced patterns.
A practical review routine looks like this:
- Pick three to five high-value user journeys.
- Measure success by format validity, correct action selection, task completion, and latency.
- Log every schema failure, unnecessary tool call, and ambiguous handoff.
- Ask whether the current pattern is too weak or simply poorly implemented.
- Only upgrade to a more flexible pattern if the simpler one cannot meet the requirement.
In many teams, reliability problems blamed on “the model” are really design problems: unclear system prompts, overloaded tools, weak retrieval, or no evaluation loop. Before changing control patterns, tighten those fundamentals. Helpful companion guides include Prompt Patterns to Defeat AI Sycophancy: Engineering Balanced, Critical Responses and Rethinking Unlimited Plans: Engineering Fair Usage and Cost Controls for AI SaaS.
The most durable strategy is to treat JSON mode, function calling, and tool use as layers in a stack, not competing ideologies. Start with the minimum control surface your application needs. Add flexibility only where measurable value justifies the added cost, latency, and complexity. That approach will age better than chasing whichever control pattern happens to be fashionable this quarter.