Agent SDKs vs. LLM SDKs
For simplicity, let’s make a distinction between the following two types of SDKs:
- Abstract away HTTP request/response handling and parsing
- Offer unified APIs across multiple model providers
- Support tool definition but not automatic function calling or allow disabling it
- Leave execution control in your hands
- Best fit for Restate integration
- Examples: LiteLLM (Python), Vercel AI SDK (TypeScript), LangChain4j (Java), Spring AI (Java)
- Provide high-level DSLs for defining agent behavior
- Handle the complete agent loop automatically
- Use configuration-driven approaches
- Manage tool execution and agent-to-agent communication
- Require deeper integration with Restate
- Examples: OpenAI Agents SDK, Vercel AI SDK (TypeScript), CrewAI, AutoGen
Integrating LLM SDKs
With LLM SDKs, you maintain full control over when and how LLM calls are made. The integration is straightforward: Wrap LLM calls inctx.run() blocks to make them durable and fault-tolerant:
- Automatic retries on failures (API timeouts, rate limits, network issues)
- Progress preservation - completed steps aren’t re-executed
- Works with any LLM SDK without modification
- You own the control flow and can use Restate to make it resilient
- Mix LLM-based steps and traditional workflows with a single resiliency strategy
Integrating Agent SDKs
Agent SDKs require more careful integration since they control the execution loop. You have two approaches:Approach 1: Wrap the Entire Agent Execution
The simplest approach is to wrap the complete agent execution in a singlectx.run() block:
Approach 2: Fine-Grained Integration
For better resilience, integrate Restate more deeply within the agent execution. This requires a deep understanding of the Agent SDK you are using and careful designing and testing to avoid non-determinism issues or infinite retries in production. Key Requirements:- Wrap model calls: Ensure all LLM requests are wrapped in
ctx.run(). We recommend setting the maximum number of retry attempts to prevent infinite retries and high LLM costs. You can do this via thectx.runoptions (TS/Python). - Pass Restate Context to tools: Make the Restate Context available in your tool implementations to be able to execute durable steps. The recommended way to do this depends on the Agent SDK. Have a look at our existing integrations for inspiration. Alternatively, you integrate more deeply with the Agent SDK and let it wrap each tool execution in
ctx.run, to treat it as a single recoverable step. - Handle terminal errors: Some Agent SDKs catch all errors that happen during tool execution and use them as input to the next iteration of the agent. Restate uses the concept of Terminal Errors for errors which should not be retried. You might want to properly propagate these errors to make sure the agent doesn’t retry them.
- Propagate suspensions: Restate suspends your agent execution when your agent is idly waiting. It does this by throwing a suspension error. Similar to terminal errors, you need to make sure that these suspension errors do not get ingested by the LLM but get re-raised.
- Disable streaming LLM responses: Restate’s
ctx.run()- blocks do not support streaming responses. Therefore, you should turn off streaming for model responses or wait for the full response to arrive. - Disable parallelism: Configure the agent to execute tools sequentially. Agent SDKs use the native async execution libraries to parallelize tool calls (
asyncio.gatherfor Python, andPromise.allfor TypeScript). This does not work with Restate’s replay mechanism because the order of completion may differ on retries, leading to a non-deterministic journal. Therefore, you need to disable this form of parallelism and implement parallelism yourself with Restate’s concurrent task primitives, as shown in this guide. - Ensure determinism: If the Agent SDK generates IDs, timestamps or other non-deterministic values, then these need to be persisted in Restate to make replay deterministic.
- Persist built-in tool results: Many agent SDKs have built-in tools such as web search or database queries. Make sure that these tool executions are wrapped in
ctx.runto make them deterministic on replay. - State management: Optionally, use Restate for agent session state. Some SDKs let you implement a session store interface to plug in your preferred session state store. For this, you can use Restate’s K/V store in combination with Virtual Objects keyed by session ID.
Benefits of fine-grained integration:
- Individual steps within tool executions are fault-tolerant
- Partial progress is preserved during failures
- Better observability into agent execution steps
- No re-execution of expensive or slow operations