Skip to main content
Restate has a flexible programming model that allows it to be integrated with many LLM or Agent SDKs to add fault tolerance to your AI applications. While each SDK has its own API, the integration approach follows similar principles. This guide covers the key patterns and requirements for integrating Restate with different types of SDKs.
If you are interested in the integration of a specific LLM / Agent SDK, please reach out to us via Discord or Slack.

Agent SDKs vs. LLM SDKs

For simplicity, let’s make a distinction between the following two types of SDKs: Diagram showing LLM SDKs vs Agent SDKs architecture 1. LLM SDKs - These provide simplified interfaces for making LLM calls:
  • Abstract away HTTP request/response handling and parsing
  • Offer unified APIs across multiple model providers
  • Support tool definition but not automatic function calling or allow disabling it
  • Leave execution control in your hands
  • Best fit for Restate integration
  • Examples: LiteLLM (Python), Vercel AI SDK (TypeScript), LangChain4j (Java), Spring AI (Java)
2. Agent SDKs - These manage both LLM calls and the agent execution loop:
  • Provide high-level DSLs for defining agent behavior
  • Handle the complete agent loop automatically
  • Use configuration-driven approaches
  • Manage tool execution and agent-to-agent communication
  • Require deeper integration with Restate
  • Examples: OpenAI Agents SDK, Vercel AI SDK (TypeScript), CrewAI, AutoGen

Integrating LLM SDKs

With LLM SDKs, you maintain full control over when and how LLM calls are made. The integration is straightforward: Wrap LLM calls in ctx.run() blocks to make them durable and fault-tolerant:
// Instead of direct LLM calls
const result = await openai.chat.completions.create({...});

// Wrap them in ctx.run()
const result = await ctx.run("analyze-text", async () => {
  return openai.chat.completions.create({...});
});
Benefits:
  • Automatic retries on failures (API timeouts, rate limits, network issues)
  • Progress preservation - completed steps aren’t re-executed
  • Works with any LLM SDK without modification
  • You own the control flow and can use Restate to make it resilient
  • Mix LLM-based steps and traditional workflows with a single resiliency strategy
Examples: Find complete integration examples here:TypeScript / Python

Integrating Agent SDKs

Agent SDKs require more careful integration since they control the execution loop. You have two approaches:

Approach 1: Wrap the Entire Agent Execution

The simplest approach is to wrap the complete agent execution in a single ctx.run() block:
const result = await ctx.run("agent-execution", async () => {
  return await agent.execute(input);
});
This provides coarse-grained fault tolerance but treats the entire agent as a single recoverable step.

Approach 2: Fine-Grained Integration

For better resilience, integrate Restate more deeply within the agent execution. This requires a deep understanding of the Agent SDK you are using and careful designing and testing to avoid non-determinism issues or infinite retries in production. Key Requirements:
  • Wrap model calls: Ensure all LLM requests are wrapped in ctx.run(). We recommend setting the maximum number of retry attempts to prevent infinite retries and high LLM costs. You can do this via the ctx.run options (TS/Python).
  • Pass Restate Context to tools: Make the Restate Context available in your tool implementations to be able to execute durable steps. The recommended way to do this depends on the Agent SDK. Have a look at our existing integrations for inspiration. Alternatively, you integrate more deeply with the Agent SDK and let it wrap each tool execution in ctx.run, to treat it as a single recoverable step.
  • Handle terminal errors: Some Agent SDKs catch all errors that happen during tool execution and use them as input to the next iteration of the agent. Restate uses the concept of Terminal Errors for errors which should not be retried. You might want to properly propagate these errors to make sure the agent doesn’t retry them.
  • Propagate suspensions: Restate suspends your agent execution when your agent is idly waiting. It does this by throwing a suspension error. Similar to terminal errors, you need to make sure that these suspension errors do not get ingested by the LLM but get re-raised.
  • Disable streaming LLM responses: Restate’s ctx.run()- blocks do not support streaming responses. Therefore, you should turn off streaming for model responses or wait for the full response to arrive.
  • Disable parallelism: Configure the agent to execute tools sequentially. Agent SDKs use the native async execution libraries to parallelize tool calls (asyncio.gather for Python, and Promise.all for TypeScript). This does not work with Restate’s replay mechanism because the order of completion may differ on retries, leading to a non-deterministic journal. Therefore, you need to disable this form of parallelism and implement parallelism yourself with Restate’s concurrent task primitives, as shown in this guide.
  • Ensure determinism: If the Agent SDK generates IDs, timestamps or other non-deterministic values, then these need to be persisted in Restate to make replay deterministic.
  • Persist built-in tool results: Many agent SDKs have built-in tools such as web search or database queries. Make sure that these tool executions are wrapped in ctx.run to make them deterministic on replay.
  • State management: Optionally, use Restate for agent session state. Some SDKs let you implement a session store interface to plug in your preferred session state store. For this, you can use Restate’s K/V store in combination with Virtual Objects keyed by session ID.
These are the requirements we bumped into when integrating Restate with libraries like Vercel AI SDK and OpenAI Agents SDK. New requirements might need to be added for other SDKs.If you are considering an integration, please reach out to us via Discord or Slack.
Benefits of fine-grained integration:
  • Individual steps within tool executions are fault-tolerant
  • Partial progress is preserved during failures
  • Better observability into agent execution steps
  • No re-execution of expensive or slow operations