Why Anthropic 529 Errors Skip Your Fallback Models (And How to Fix It)
When Anthropic's API returns a 529 "overloaded" error, you'd expect OpenClaw's model fallback system to kick in and try your secondary models. Instead, users are seeing errors bubble up directly — even when they have perfectly good fallbacks configured.
A recent GitHub issue (#28502) uncovered exactly why this happens, and the root cause reveals something important about how OpenClaw's retry architecture actually works.
The Two-Loop Architecture
OpenClaw has two distinct retry mechanisms working together:
- Inner fallback loop (
runWithModelFallback) — cycles through your configured models (primary → secondary → tertiary) when one fails - Outer retry loop (
agent-runner-execution) — catches transient errors and retries the entire fallback chain after a delay
The problem? HTTP 529 is recognized by the outer loop but not by the inner loop.
What Actually Happens
When Anthropic returns 529:
- The inner fallback loop sees an error it doesn't recognize as "fallback-worthy"
- It throws the error up to the outer loop instead of trying your secondary models
- The outer loop catches it (529 is in
TRANSIENT_HTTP_ERROR_CODES), waits, then retries - The retry starts the whole primary→fallback chain again
- If Anthropic is still overloaded, you've burned two attempts on Claude without ever trying your fallback
The outer loop's comment even explains the reasoning: "transient errors typically affect the whole provider, so falling back to an alternate model first would not help."
But this assumption breaks down when:
- Your fallback is a different provider (OpenAI, local Ollama)
- You're using proxy providers where 529 might be endpoint-specific
- You'd rather get any response than wait and retry the same failing model
The Fix
The proposed solution is elegantly simple — add 529 to resolveFailoverReasonFromError in failover-error.ts:
if (status === 529) {
return "timeout";
}This lets the inner fallback loop try your secondary models before the outer loop ever kicks in. Best of both worlds: fallbacks get attempted first, and if all models return 529, the outer retry loop still provides a second chance after a delay.
What You Can Do Now
Until this is merged:
- Order your fallbacks strategically — put different providers first in your fallback chain
- Monitor your fallback usage — if you're seeing 529 errors with untouched fallbacks, this is why
- Consider provider diversity — mixing Claude, GPT, and local models gives you resilience against single-provider outages
This is a great example of how understanding OpenClaw's internals helps you build more robust agent configurations. The retry architecture is sophisticated, but knowing where the gaps are lets you work around them.
Track the fix: GitHub #28502
Comments (0)
No comments yet. Be the first to comment!