Context Management & Reliability
Roughly 15% of the exam.
Prompt caching
One invariant explains everything: caching is a prefix match. The cache keys on the exact rendered bytes up to each breakpoint; a single changed byte at position N invalidates every breakpoint at or after N.
-
Render order is
tools → system → messages— a breakpoint on the last system block caches tools and system together. - Maximum 4
cache_controlbreakpoints per request. - Economics (5-minute default TTL): writes ≈ 1.25× base input price (2× for the 1-hour TTL), reads ≈ 0.1×. Break-even after about two uses.
-
Verify with
usage.cache_read_input_tokens. Zero across repeated identical requests means a silent invalidator: a timestamp or UUID interpolated into the system prompt, non-deterministic JSON key order, or a tool set that varies per request. - Tool definition changes and model switches invalidate everything (they render first / caches are model-scoped). Keep the system prompt frozen; inject dynamic context later in the messages.
Tokens & limits
-
Count tokens with the
count_tokensendpoint using the same model you will run inference on. Never tiktoken — it's another vendor's tokenizer and undercounts for Claude. -
stop_reason: "max_tokens"= your per-response output cap hit (raise it or stream).model_context_window_exceeded= the whole conversation no longer fits (compact or trim). Different problems, different fixes. -
Large outputs require streaming in practice — buffered requests at high
max_tokenshit HTTP timeouts.
Batch processing
The Message Batches API processes requests asynchronously at 50% of standard token prices — the default answer for any high-volume, latency-tolerant workload (nightly classification, bulk extraction). Most batches finish within an hour (24-hour max), all Messages API features work inside them, and results stay retrievable for 29 days.
Errors & retries
- Retryable: 429 (rate limit — honor
retry-after), 500 (internal), 529 (overloaded). Exponential backoff; official SDKs retry these automatically. - Not retryable as-is: 400 (malformed request), 401 (auth), 403 (permission), 404 (bad model ID or endpoint), 413 (too large). Fix the request instead.
- Use the SDK's typed exception classes rather than string-matching error messages.
Compaction
Server-side compaction summarizes earlier history when a long conversation approaches the
context limit. The trap the exam checks: you must append the full response.content to history every turn — compaction blocks are state the API
needs back; extracting only the text silently loses them.
Independent community study resource — not affiliated with or endorsed by Anthropic. Claude is a trademark of Anthropic, PBC. All study material and practice questions here are original, written from Anthropic's public documentation. Everything runs in your browser; nothing you answer is stored or transmitted.