- Build durable AI agents that recover automatically from crashes and API failures
- Integrate Restate with the OpenAI Agent SDK for Python
- Observe and debug agent executions with detailed traces
- Implement resilient human-in-the-loop workflows with approvals and timeouts
- Manage conversation history and state across multi-turn interactions
- Orchestrate multiple agents working together on complex tasks
Getting Started
A Restate AI application has two main components:- Restate Server: The core engine that takes care of the orchestration and resiliency of your agents
- Agent Services: Your agent or AI workflow logic using the Restate SDK for durability

Run the agent
Install Restate and launch it:http://localhost:9070
) or CLI:
run
handler of the WeatherAgent
in the overview:

curl
:
Durable Execution
AI agents make multiple LLM calls and tool executions that can fail due to rate limits, network issues, or service outages. Restate uses Durable Execution to make your agents withstand failures without losing progress. The Restate SDK records the steps the agent executes in a log and replays them if the process crashes or is restarted:
Creating a Durable Agent
To implement a durable agent, you can use the Restate SDK in combination with the OpenAI Agent SDK. Here’s the implementation of the durable weather agent you just invoked:durable_agent.py
run
handler.
The endpoint that serves the agents of this tour over HTTP is defined in __main__.py
.
The agent can now be called at http://localhost:8080/WeatherAgent/run
.
The main difference compared to a standard OpenAI agent is the use of the Restate Context at key points throughout the agent logic.
Any action with the Context is automatically recorded by the Restate Server and survives failures.
We use this for:
- Persisting LLM responses: We use the
DurableModelCalls(restate_context)
model provider inRunner.run
, so that every LLM response is saved in Restate Server and can be replayed during recovery. - Resilient tool execution: Tools can make steps durable by using Context actions. Their outcome will then be persisted for recovery and retried until they succeed.
restate_context.run_typed
runs an action durably, retrying it until it succeeds and persisting the result in Restate (e.g. database interaction, API calls, non-deterministic actions).
Propagating Restate Context to tools
Propagating Restate Context to tools
The Restate Context gets supplied to the
run
handler by the Restate SDK when the handler is invoked.
This context is then propagated to tools via the agent context.
The Restate Context can also be contained in an object together with additional context.
Learn more from the OpenAI docs.Try out Durable Execution
Try out Durable Execution
Ask for the weather in Denver:On the invocation page in the UI, click on the invocation ID of the failing invocation.
You can see that your request is retrying because the weather API is down:
To fix the problem, remove the line Once you restart the service, the workflow finishes successfully.

fail_on_denver
from the fetch_weather
function in the app/utils/utils.py
file:utils/utils.py
Observing your Agent
As you saw in the previous section, the Restate UI comes in handy when monitoring and debugging your agents. The Invocations tab shows all agent executions with detailed traces of every LLM call, tool execution, and state change:
OpenTelemetry Integration
OpenTelemetry Integration
Restate supports OpenTelemetry for exporting traces to external systems like Langfuse, DataDog, or Jaeger:Have a look at the tracing docs to set this up.
Human-in-the-Loop Agent
Many AI agents need human oversight for high-risk decisions or gathering additional input. Restate makes it easy to pause agent execution and wait for human input. Benefits with Restate:- If the agent crashes while waiting for human input, Restate continues waiting and recovers the promise on another process.
- If the agent runs on function-as-a-service platforms, the Restate SDK lets the function suspend while its waiting. Once the approval comes in, the Restate Server invokes the function again and lets it resume where it left off. This way, you don’t pay for idle waiting time (Learn more).
human_approval_agent.py
You can also use awakeables outside of tools, for example, to implement human approval steps in between agent iterations.
Try out human approval
Try out human approval
Start a request for a high-value claim that needs human approval.
Use the playground or You can restart the service to see how Restate continues waiting for the approval.If you wait for more than a minute, the invocation will get suspended.Simulate approving the claim by executing the curl request that was printed in the service logs, similar to:See in the UI how the workflow resumes and finishes after the approval.
curl
with /send
to start the claim asynchronously, without waiting for the result.
Timeouts and Escalation
Timeouts and Escalation
Add timeouts to human approval steps to prevent workflows from hanging indefinitely.Restate persists the timer and the approval promise, so if the service crashes or is restarted, it will continue waiting with the correct remaining time:Try it out by sending a request to the service:You restart the service and check in the UI how the process will block for the remaining time without starting over.You can also lower the timeout to a few seconds to see how the timeout path is taken.
human_approval_agent_with_timeout.py
Resilient workflows as tools
You can pull out complex parts of your tool logic into separate workflows. This lets you break down complex agents into smaller, reusable components that can be developed, deployed, and scaled independently. The Restate SDK gives you clients to call other Restate services durably from your agent logic. All calls are proxied via Restate. Restate persists the call and takes care of retries and recovery. For example, let’s implement the human approval tool as a separate service:sub_workflow_agent.py
sub_workflow_agent.py
Try out sub-workflows
Try out sub-workflows
Start a request for a high-value claim that needs human approval.
Use In the UI, you can see that the agent called the workflow service and is waiting for the response.
You can see the trace of the sub-workflow in the timeline.Once you approve the claim, the workflow returns, and the agent continues.
/send
to start the claim asynchronously, without waiting for the result.
Follow the Tour of Workflows to learn more about implementing resilient workflows with Restate.
Durable Sessions
The next ingredient we need to build AI agents is the ability to maintain context and memory across multiple interactions. The OpenAI SDK allows plugging in custom session providers to manage conversation history. This integrates very well with Restate’s stateful entities, called Virtual Objects. Virtual Objects are Restate’s way of implementing stateful services with durable state management and built-in concurrency control. To implement stateful entities like chat sessions, or stateful agents, Restate provides Virtual Objects. Each Virtual Object instance maintains isolated state and is identified by a unique key.Virtual Objects as OpenAI Session Providers
The Restate OpenAI middleware includes a SessionProvider that automatically persists the agent’s conversation history in the Virtual Object state. Here is an example of a stateful, durable agent represented as a Virtual Object:
chat.py
- Long-lived state: K/V state is stored permanently. It has no automatic expiry. Clear it via
ctx.clear()
. - Durable state changes: State changes are logged with Durable Execution, so they survive failures and are consistent with code execution
- State is queryable via the state tab in the UI.

- Built-in concurrency control: Restate’s Virtual Objects have built-in queuing and consistency guarantees per object key. Handlers either have read-write access (
ObjectContext
) or read-only access (shared object context).- Only one handler with write access can run at a time per object key to prevent concurrent/lost writes or race conditions (for example
message()
). - Handlers with read-only access can run concurrently to the write-access handlers (for example
get_history()
).
- Only one handler with write access can run at a time per object key to prevent concurrent/lost writes or race conditions (for example

Try out Virtual Objects
Try out Virtual Objects
Stateful Chat Agent:Ask the agent to do some task:Continue the conversation - the agent remembers previous context:Get conversation history or view it in the UI:Seeing concurrency control in action:In the chat service, the The UI shows how Restate queues the requests per session to ensure consistency:
message
handler is an exclusive handler, while the getHistory
handler is a shared handler.Let’s send some messages to a chat session:
Stateful Serverless Agents
Stateful Serverless Agents
You can run Virtual Objects on serverless platforms like Modal, Render, or AWS Lambda.
When the request comes in, Restate attaches the correct state to the request, so your handler can access it locally.This way, you can implement stateful, serverless agents without managing any external state store and without worrying about concurrency issues.
Virtual Objects for storing context
You can store any context information in Virtual Objects, for example, user preferences or the last agent they interacted with. Usectx.set
and ctx.get
in your handler to store and retrieve state.
We will show an example of this in the next section when we orchestrate multiple agents.
Resilient multi-agent coordination
As your agents grow more complex, you may want to break them down into smaller, specialized agents that can delegate tasks to each other. Similar to sub-workflows, you can break down complex agents into multiple specialized agents. All agents can run in the same process or be deployed independently.Agents as tools/handoffs
If you want to share context between agents, run the agents in the same process and use handoffs or tools. You don’t need to do anything special to make this work with Restate. Use Virtual Object state to maintain context between runs. For example, store the last agent that was called in the object state, so the user can connect back seamlessly on the next interaction:multi_agent.py
Try out multi-agent systems
Try out multi-agent systems
Start a request for a claim that needs to be analyzed by multiple agents.In the UI, you can see that the agent called the sub-agents and is waiting for their responses.
You can see the trace of the sub-agents in the timeline.Once all sub-agents return, the main agent continues and makes a decision.
The state now contains the last agent that was called, so you can continue the conversation directly with the same agent:


Remote agents as tools
If you want to run agents independently, for example, to scale them separately, run them on different platforms, or let them get developed by different teams, then you can call them as tools via service calls. Restate will proxy all calls, persist them, and will guarantee that they complete successfully. Your main agent can suspend and save resources while waiting for the remote agent to finish. Restate invokes your main agent again once the remote agent returns.multi_agent.py
You cannot put both agents within the same Virtual Object, because this leads to deadlocks.
The main agent would block on the call to the sub-agent, preventing the sub-agent from executing, cause only one handler can run at a time per object key.
Try out multi-agent systems
Try out multi-agent systems
Start a request for a claim that needs to be analyzed by multiple agents.In the UI, you can see that the agent called the sub-agents and is waiting for their responses.
You can see the trace of the sub-agents in the timeline.Once all sub-agents return, the main agent continues and makes a decision.

Parallel Work
Now that our agents are broken down into smaller parts, let’s have a look at how to run different parts of our agent logic in parallel to speed up execution.When using the OpenAI Agent SDK with Restate, tool calls are executed sequentially by default to ensure deterministic execution during replays.
When multiple tools execute in parallel and use the Restate Context, the order of operations might differ between the original execution and the replay, leading to inconsistencies.
restate.gather
to gather their results or restate.select
to wait for the first one to complete.
Parallel Tool Steps
To parallelize tool steps, implement an orchestrator tool that uses durable execution to run multiple steps in parallel. Here is an insurance claim agent tool that runs multiple analyses in parallel:parallel_tools_agent.py
If you want to allow the LLM to call multiple tools in parallel, then you need to manually implement the agent tool execution loop using
restate.select
and durable promises.Try out parallel tool steps
Try out parallel tool steps
Start a request for a claim that needs to be analyzed by multiple tools in parallel.In the UI, you can see that the agent ran the tool steps in parallel.
Their traces all start at the same time.Once all tools return, the agent continues and makes a decision.

Parallel Agents
You can use the same durable execution primitives to run multiple agents in parallel. For example, to race agents against each other and use the first result that returns, while cancelling the others. Or to let a main orchestrator agent combine the results of multiple specialized agents in parallel:parallel_agents.py
Try out parallel agents
Try out parallel agents
Start a request for a claim that needs to be analyzed by multiple agents in parallel.In the UI, you can see that the handler called the sub-agents in parallel.
Once all sub-agents return, the main agent makes a decision.

Error Handling
LLM calls are costly, so you can configure retry behavior in both Restate and your AI SDK to avoid infinite loops and high costs. Restate distinguishes between two types of errors:- Transient errors: Temporary issues like network failures or rate limits. Restate automatically retries these until they succeed or the retry policy is exhausted.
- Terminal errors: Permanent failures like invalid input or business rule violations. Restate does not retry these. The invocation fails permanently. You can catch these errors and handle them gracefully.
Retries of LLM calls
Restate’sDurableModelCalls
provider lets you specify the maximum number of retries for LLM calls.
TerminalError
and won’t be retried further.
Tool execution errors
By default, the OpenAI Agent SDK will convert any error in tool execution into a message to the LLM, and the LLM will decide how to proceed. This is often desirable, as the LLM can decide to retry the tool call, use a different tool, or provide a fallback answer.Surfacing suspensions and terminal errors
There are some Restate errors that should not be handled by the LLM, for example, if a tool execution is suspended waiting for human input. Restate lets tool executions suspend if they need to wait for a long time. This suspension is started by raising an artificial error. These errors should be re-raised by the agent, instead of ingested. Therefore, you should always set your tool’sfailure_error_function
to raise Restate errors like suspensions.
error_handling.py
error_handling.py
The OpenAI Agent SDK also allows setting
failure_error_function
to None
, which will rethrow any error in the agent execution as-is.
Also for example invalid LLM responses (e.g. tool call with invalid arguments or to a tool that doesn’t exist).
The error will then lead to Restate retries. Restate will recover the invocation by replaying the journal entries.
This can lead to infinite retries if the error is not transient.
Therefore, be careful when using this option and handle errors appropriately in your agent logic.
You also might want to set a retry policy at the service or handler level to avoid infinite retries.Retry-ing transient errors
If you use Restate Context actions likectx.run
in your tool execution, Restate will retry any transient errors in these actions until they succeed.
So for all operations that might suffer from transient errors (like network calls, database interactions, etc.), you should use Context actions to make them resilient.
Here is a small practical example:
You can set custom retry policies for
ctx.run
steps in your tool executions.Advanced patterns
Manual Agent Loop
Manual Agent Loop
If you need more control over the agent loop, you can implement it manually using Restate’s durable primitives.This allows you to:This can be extended to include any custom control flow you need: persistent state, parallel tool calls, custom stopping conditions, or custom error handling.Try it out by sending a request to the service:In the UI, you can see how the agent runs multiple iterations and calls tools.
- Parallelize tool calls with
restate.select
andrestate.gather
- Implement custom stopping conditions
- Implement custom logic between steps (e.g. human approval)
- Interact with external systems between steps
- Handle errors in a custom way
advanced/manual_loop_agent.py

Rolling back tool executions on failure
Rolling back tool executions on failure
Sometimes you need to undo previous agent actions when a later step fails. Restate makes it easy to implement compensation patterns (Sagas) for AI agents.Just track the rollback actions as you go, let the agent rethrow terminal tool errors, and execute the rollback actions in reverse order.Here is an example of a travel booking agent that first reserves a hotel, flight and car, and then either confirms them or rolls back if any step fails with a terminal error (e.g. car type not available).We let tools add rollback actions to the agent context for each booking step the do.
The Try it out by sending the following request:Have a look at the UI to see how the flight booking fails, and the bookings are rolled back.Check out the sagas guide for more details.
run
handler catches any terminal errors and runs all the rollback actions.advanced/rollback_agent.py

Long-running background agents
Long-running background agents
Restate supports implementing scheduling and timer logic in your agents.
This allows you to build agents that run periodically, wait for specific times, or implement complex scheduling logic.
Agents can either be long-running or reschedule themselves for later execution.Have a look at the scheduling docs to learn more.
Streaming back intermediate results
Streaming back intermediate results
Have a look at the pub-sub example.
Interrupting agents
Interrupting agents
Have a look at the interruptible coding agent.
Summary
Durable Execution, paired with your existing SDKs, gives your agents a powerful upgrade:- Durable Execution: Automatic recovery from failures without losing progress
- Persistent memory and context: Persistent conversation history and context
- Observability by default across your agents and workflows
- Human-in-the-Loop: Seamless approval workflows with timeouts
- Multi-Agent Coordination: Reliable orchestration of specialized agents
- Suspensions to save costs on function-as-a-service platforms when agents need to wait
- Advanced Patterns: Real-time progress updates, interruptions, and long-running workflows
Next Steps
- Learn more about how to implement resilient tools with Restate in the Tour of Workflows
- Check out the other Restate AI examples on GitHub
- Sign up for Restate Cloud and start building agents without managing infrastructure