AWS for AI/Agent Developers — Day 3: LLM Caching with ElastiCache + Bedrock

LLM calls are the most expensive and slowest part of any agent system. A single generation can cost $0.01-0.10 and take 2-10 seconds. In production, those milliseconds and cents multiply fast.

Two caching strategies cut that dramatically:

Bedrock Prompt Caching — Server-side cache on the model provider. No code change. Amazon’s internal cache matches against recent prompts.
Semantic Cache on Redis — Application-side cache. Embed prompts, store semantically similar results. Works for any model, any provider.

Combined, you can reduce LLM costs by 40-70% and latency by 60-90% (from seconds to milliseconds).

1
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
2
│   Agent      │────▶│  Cache       │────▶│  LLM         │
3
│   Runtime    │     │  Layer       │     │  (Bedrock)   │
4
│              │     │              │     │              │
5
│  Prompt      │     │  ┌────────┐ │     │  ┌────────┐  │
6
│  Generation  │     │  │ Redis  │ │     │  │ Claude │  │
7
│              │     │  │Cache   │ │     │  │ Sonnet │  │
8
│              │     │  └───┬────┘ │     │  └────────┘  │
9
│              │     │      │      │     │              │
10
│              │     │  ┌───┴────┐ │     │  Bedrock     │
11
│              │     │  │Semantic│ │     │  Prompt      │
12
│              │     │  │Matcher │ │     │  Cache       │
13
│              │     │  └────────┘ │     │  (Built-in)  │
14
└──────────────┘     └──────────────┘     └──────────────┘

Strategy 1: Bedrock Prompt Caching (Zero Code)#

Bedrock supports prompt caching for Anthropic Claude models. When you send a prompt identical to one sent recently, Bedrock returns the cached response almost instantly at ~90% cost reduction.

Cache hit: ~ $0.003 vs fresh call ~$ 0.03 (for Claude Sonnet, ~500 token prompt)

Enable via AWS CLI:#

1
# Bedrock prompt caching is model-specific and region-specific.
2
# It's enabled by default — no config needed.
3
# Cache window: ~5 minutes (sliding)
4

5
# Verify caching works by sending duplicate prompts:
6
aws bedrock-runtime converse \
7
  --model-id anthropic.claude-3-5-sonnet-20241022-v2:0 \
8
  --messages '[{"role":"user","content":[{"text":"What is MCP in 10 words?"}]}]'
9

10
# Send again — if within 5 min, Bedrock returns cached result.
11
# You'll see CacheReadInputTokens / CacheReadOutputTokens in CloudWatch.

Limitations:

Cache window is ~5 minutes (sliding)
Exact prompt match only — no semantic similarity
Model-specific (Claude Sonnet 3.5 v2, Haiku 3.5)
Region-specific

Monitor cache hits in CloudWatch:#

1
aws cloudwatch get-metric-statistics \
2
  --namespace AWS/Bedrock \
3
  --metric-name CacheReadInputTokens \
4
  --dimensions Name=ModelId,Value=anthropic.claude-3-5-sonnet-20241022-v2:0 \
5
  --start-time "2026-06-28T00:00:00Z" --end-time "2026-06-28T23:00:00Z" \
6
  --period 300 --statistics Sum

Strategy 2: Semantic Cache on Redis (Full Control)#

This is where the real savings live. Instead of exact matching, we:

Generate an embedding for each prompt
Store in Redis with embedding vector + response + metadata
On new prompt: check if a semantically similar prompt was cached
If similarity > threshold, return cached response

Setup ElastiCache Redis#

1
# Create Redis cluster with vector search support (Redis Stack)
2
aws elasticache create-serverless-cache \
3
  --serverless-cache-name llm-semantic-cache \
4
  --engine redis \
5
  --major-engine-version 7 \
6
  --description "LLM prompt semantic cache"
7

8
# Note endpoint from output
9
aws elasticache describe-serverless-caches \
10
  --serverless-cache-name llm-semantic-cache \
11
  --query 'ServerlessCaches[0].Endpoint'

`src/cache/semantic-cache.ts`#

1
// src/cache/semantic-cache.ts — Redis-backed semantic cache for LLM prompts
2

3
import { createClient, RedisClientType } from "redis";
4
import { BedrockRuntimeClient, InvokeModelCommand } from "@aws-sdk/client-bedrock-runtime";
5

6
interface CacheEntry {
7
  prompt: string;
8
  response: string;
9
  embedding: number[];
10
  modelId: string;
11
  tokensIn: number;
12
  tokensOut: number;
13
  timestamp: number;
14
  hitCount: number;
15
}
16

17
interface CacheConfig {
18
  modelId: string;
19
  similarityThreshold: number;   // 0.0 - 1.0, default ~0.92
20
  maxCacheAgeMs: number;         // TTL per entry
21
  maxEntries: number;            // Evict when exceeded
22
  enabled: boolean;              // Quick toggle
23
}
24

25
const DEFAULT_CONFIGS: Record<string, CacheConfig> = {
26
  "claude-sonnet": {
27
    modelId: "anthropic.claude-3-5-sonnet-20241022-v2:0",
28
    similarityThreshold: 0.92,
29
    maxCacheAgeMs: 300_000,       // 5 minutes (matches Bedrock cache window)
30
    maxEntries: 10_000,
31
    enabled: true,
32
  },
33
  "claude-haiku": {
34
    modelId: "anthropic.claude-3-5-haiku-20241022-v1:0",
35
    similarityThreshold: 0.90,
36
    maxCacheAgeMs: 60_000,        // 1 minute (cheap model, short cache)
37
    maxEntries: 5_000,
38
    enabled: true,
39
  },
40
  // You can add custom configs per agent use-case
41
  "code-review": {
42
    modelId: "anthropic.claude-3-5-sonnet-20241022-v2:0",
43
    similarityThreshold: 0.85,    // Looser for code review patterns
44
    maxCacheAgeMs: 600_000,       // 10 minutes
45
    maxEntries: 20_000,
46
    enabled: true,
47
  },
48
};
49

50
export class SemanticCache {
51
  private redis: RedisClientType;
52
  private bedrock: BedrockRuntimeClient;
53
  private configs: Record<string, CacheConfig>;
54
  private embeddingModelId: string;
55

56
  constructor(options: {
57
    redisUrl?: string;
58
    region?: string;
59
    configs?: Record<string, CacheConfig>;
60
    embeddingModelId?: string;
61
  }) {
62
    this.redis = createClient({ url: options.redisUrl || process.env.REDIS_URL });
63
    this.bedrock = new BedrockRuntimeClient({ region: options.region || "us-east-1" });
64
    this.configs = options.configs || DEFAULT_CONFIGS;
65
    this.embeddingModelId = options.embeddingModelId || "amazon.titan-embed-text-v2:0";
66
  }
67

68
  async connect(): Promise<void> {
69
    await this.redis.connect();
70
  }
71

72
  /**
73
   * Get cached response for a prompt, or null if no match.
74
   */
75
  async get(
76
    prompt: string,
77
    configKey: string = "claude-sonnet"
78
  ): Promise<{
79
    response: string;
80
    hitCount: number;
81
    similarity: number;
82
  } | null> {
83
    const config = this.configs[configKey];
84
    if (!config || !config.enabled) return null;
85

86
    // Generate embedding for this prompt
87
    const embedding = await this.embedPrompt(prompt);
88

89
    // Search Redis for similar embeddings
90
    const results = await this.redis.ft.search(
91
      "idx:llm-cache",
92
      `@modelId:{${config.modelId}} @embedding:[VECTOR_RANGE 0.08 $vec]=>{$YIELD_DISTANCE:1}`, {
93
        PARAMS: { vec: this.vectorToString(embedding) },
94
        SORTBY: "VECTOR_DISTANCE",
95
        LIMIT: { from: 0, size: 1 },
96
        DIALECT: 2,
97
      }
98
    );
99

100
    if (!results.total) return null;
101

102
    const doc = results.documents[0];
103
    const similarity = 1 - parseFloat(doc.value.distance as string);
104

105
    // Check threshold
106
    if (similarity < config.similarityThreshold) return null;
107

108
    // Check TTL
109
    const age = Date.now() - (doc.value.timestamp as number);
110
    if (age > config.maxCacheAgeMs) {
111
      // Stale entry — delete and treat as miss
112
      await this.redis.del(`cache:${doc.id}`);
113
      return null;
114
    }
115

116
    // Increment hit count
117
    await this.redis.hIncrBy(`cache:${doc.id}`, "hitCount", 1);
118

119
    return {
120
      response: doc.value.response as string,
121
      hitCount: (doc.value.hitCount as number) + 1,
122
      similarity,
123
    };
124
  }
125

126
  /**
127
   * Store a prompt-response pair in the cache.
128
   */
129
  async set(
130
    prompt: string,
131
    response: string,
132
    tokensIn: number,
133
    tokensOut: number,
134
    modelId: string,
135
    configKey?: string
136
  ): Promise<void> {
137
    const key = `cache:${this.hash(prompt)}`;
138
    const embedding = await this.embedPrompt(prompt);
139

140
    // Check if we need to evict
141
    const currentCount = await this.redis.dbSize();
142
    const config = configKey ? this.configs[configKey] : null;
143
    if (config && currentCount >= config.maxEntries) {
144
      await this.evictOldest();
145
    }
146

147
    await this.redis.hSet(key, {
148
      prompt,
149
      response,
150
      embedding: this.vectorToString(embedding),
151
      modelId,
152
      tokensIn,
153
      tokensOut,
154
      timestamp: Date.now(),
155
      hitCount: 1,
156
    });
157

158
    // Add to search index
159
    await this.redis.ft.synUpdate("idx:llm-cache", key, 1, [key]);
160
  }
161

162
  /**
163
   * Invalidate cache entries matching a pattern.
164
   */
165
  async invalidate(pattern: string): Promise<number> {
166
    const keys = await this.redis.keys(`cache:${pattern}*`);
167
    if (keys.length === 0) return 0;
168
    const deleted = await this.redis.del(keys);
169
    return deleted;
170
  }
171

172
  /**
173
   * Clear all cache entries.
174
   */
175
  async clear(): Promise<void> {
176
    const keys = await this.redis.keys("cache:*");
177
    if (keys.length > 0) await this.redis.del(keys);
178
  }
179

180
  /**
181
   * Get cache stats.
182
   */
183
  async getStats(): Promise<{
184
    totalEntries: number;
185
    totalHits: number;
186
    oldestEntry: number;
187
    newestEntry: number;
188
  }> {
189
    const keys = await this.redis.keys("cache:*");
190
    let totalHits = 0;
191
    let oldestEntry = Date.now();
192
    let newestEntry = 0;
193

194
    for (const key of keys) {
195
      const entry = await this.redis.hGetAll(key);
196
      if (entry.hitCount) totalHits += parseInt(entry.hitCount);
197
      if (entry.timestamp) {
198
        const ts = parseInt(entry.timestamp);
199
        if (ts < oldestEntry) oldestEntry = ts;
200
        if (ts > newestEntry) newestEntry = ts;
201
      }
202
    }
203

204
    return {
205
      totalEntries: keys.length,
206
      totalHits,
207
      oldestEntry,
208
      newestEntry,
209
    };
210
  }
211

212
  // ──── Private ────
213

214
  /**
215
   * Generate embedding for a prompt using Amazon Titan Embeddings.
216
   */
217
  private async embedPrompt(prompt: string): Promise<number[]> {
218
    const command = new InvokeModelCommand({
219
      modelId: this.embeddingModelId,
220
      contentType: "application/json",
221
      body: JSON.stringify({
222
        inputText: prompt,
223
        dimensions: 1024,
224
        normalize: true,
225
      }),
226
    });
227

228
    const result = await this.bedrock.send(command);
229
    const body = JSON.parse(new TextDecoder().decode(result.body));
230

231
    return body.embedding;
232
  }
233

234
  private vectorToString(vector: number[]): string {
235
    return `[${vector.join(",")}]`;
236
  }
237

238
  private async evictOldest(): Promise<void> {
239
    // Find and remove the oldest 10% of entries
240
    const keys = await this.redis.keys("cache:*");
241
    const entries = [];
242

243
    for (const key of keys) {
244
      const entry = await this.redis.hGetAll(key);
245
      entries.push({ key, timestamp: parseInt(entry.timestamp || "0") });
246
    }
247

248
    entries.sort((a, b) => a.timestamp - b.timestamp);
249
    const toEvict = entries.slice(0, Math.ceil(entries.length * 0.1));
250

251
    if (toEvict.length > 0) {
252
      await this.redis.del(toEvict.map(e => e.key));
253
    }
254
  }
255

256
  private hash(prompt: string): string {
257
    const crypto = require("crypto");
258
    return crypto.createHash("md5").update(prompt).digest("hex").slice(0, 12);
259
  }
260
}

Create the Redis search index:#

1
# Run once after Redis cluster is ready — index setup via FT.CREATE
2
redis-cli -h <cache-endpoint> FT.CREATE idx:llm-cache ON HASH PREFIX 1 cache: \
3
  SCHEMA prompt TEXT SORTABLE \
4
  response TEXT SORTABLE \
5
  embedding VECTOR FLAT 6 TYPE FLOAT32 DIM 1024 DISTANCE_METRIC COSINE \
6
  modelId TAG SORTABLE \
7
  timestamp NUMERIC SORTABLE \
8
  hitCount NUMERIC SORTABLE

This creates a vector search index on the embedding field using cosine similarity. Redis Stack (available in ElastiCache Serverless) supports native vector search.

Step 3: Integration with Agent Runtime#

1
// src/agent-with-cache.ts — LLM proxy with cache-first strategy
2

3
import { SemanticCache } from "./cache/semantic-cache.js";
4
import { BedrockRuntimeClient, ConverseCommand } from "@aws-sdk/client-bedrock-runtime";
5

6
export class CachedAgent {
7
  private bedrock: BedrockRuntimeClient;
8
  private cache: SemanticCache;
9

10
  constructor() {
11
    this.bedrock = new BedrockRuntimeClient({ region: "us-east-1" });
12
    this.cache = new SemanticCache({ redisUrl: process.env.REDIS_URL });
13
  }
14

15
  async generate(prompt: string, configKey: string = "claude-sonnet"): Promise<{
16
    response: string;
17
    cached: boolean;
18
    latency: number;
19
    cost?: number;
20
  }> {
21
    const start = Date.now();
22

23
    // 1. Try semantic cache
24
    const cached = await this.cache.get(prompt, configKey);
25
    if (cached) {
26
      return {
27
        response: cached.response,
28
        cached: true,
29
        latency: Date.now() - start,
30
        cost: 0,
31
      };
32
    }
33

34
    // 2. Cache miss — call Bedrock
35
    const command = new ConverseCommand({
36
      modelId: DEFAULT_CONFIGS[configKey]?.modelId || configKey,
37
      messages: [{ role: "user", content: [{ text: prompt }] }],
38
      inferenceConfig: { maxTokens: 4096 },
39
    });
40

41
    try {
42
      const result = await this.bedrock.send(command);
43
      const response = result.output?.message?.content?.[0]?.text || "";
44
      const tokensIn = result.usage?.inputTokens || 0;
45
      const tokensOut = result.usage?.outputTokens || 0;
46

47
      // 3. Store in cache (fire and forget)
48
      this.cache.set(prompt, response, tokensIn, tokensOut, configKey).catch(() => {});
49

50
      return {
51
        response,
52
        cached: false,
53
        latency: Date.now() - start,
54
        cost: this.estimateCost(tokensIn, tokensOut, configKey),
55
      };
56
    } catch (error) {
57
      return {
58
        response: `Error: ${error}`,
59
        cached: false,
60
        latency: Date.now() - start,
61
      };
62
    }
63
  }
64

65
  private estimateCost(tokensIn: number, tokensOut: number, configKey: string): number {
66
    // Approximate pricing per 1K tokens
67
    const rates: Record<string, { in: number; out: number }> = {
68
      "claude-sonnet": { in: 0.003, out: 0.015 },
69
      "claude-haiku": { in: 0.0008, out: 0.004 },
70
    };
71
    const rate = rates[configKey] || rates["claude-sonnet"];
72
    return (tokensIn / 1000) * rate.in + (tokensOut / 1000) * rate.out;
73
  }
74
}

Step 4: Config-Driven Cache Policies#

Instead of hardcoding cache behavior, use a config file:

1
{
2
  "cache": {
3
    "claude-sonnet": {
4
      "modelId": "anthropic.claude-3-5-sonnet-20241022-v2:0",
5
      "similarityThreshold": 0.92,
6
      "maxCacheAgeMs": 300000,
7
      "maxEntries": 10000,
8
      "enabled": true
9
    },
10
    "claude-haiku": {
11
      "modelId": "anthropic.claude-3-5-haiku-20241022-v1:0",
12
      "similarityThreshold": 0.90,
13
      "maxCacheAgeMs": 60000,
14
      "maxEntries": 5000,
15
      "enabled": true
16
    },
17
    "code-review": {
18
      "modelId": "anthropic.claude-3-5-sonnet-20241022-v2:0",
19
      "similarityThreshold": 0.85,
20
      "maxCacheAgeMs": 600000,
21
      "maxEntries": 20000,
22
      "enabled": true
23
    }
24
  },
25
  "embedding": {
26
    "modelId": "amazon.titan-embed-text-v2:0",
27
    "dimensions": 1024
28
  }
29
}

Step 5: Monitoring#

CloudWatch Dashboard#

1
aws cloudwatch put-dashboard --dashboard-name LLM-Cache --dashboard-body '{
2
  "widgets": [
3
    {
4
      "type": "metric",
5
      "properties": {
6
        "metrics": [
7
          ["AWS/ElastiCache", "CacheHits", {"stat": "Sum"}],
8
          ["AWS/ElastiCache", "CacheMisses", {"stat": "Sum"}]
9
        ],
10
        "period": 300,
11
        "title": "Redis Cache Hit/Miss"
12
      }
13
    },
14
    {
15
      "type": "metric",
16
      "properties": {
17
        "metrics": [
18
          ["AWS/Bedrock", "CacheReadInputTokens", {"stat": "Sum"}],
19
          ["AWS/Bedrock", "InvocationCount", {"stat": "Sum"}]
20
        ],
21
        "period": 300,
22
        "title": "Bedrock Cache Metrics"
23
      }
24
    }
25
  ]
26
}'

Application-level metrics via CloudWatch:#

1
// Emit custom metrics
2
import { CloudWatchClient, PutMetricDataCommand } from "@aws-sdk/client-cloudwatch";
3

4
const cw = new CloudWatchClient({ region: "us-east-1" });
5

6
async function emitCacheMetric(
7
  hit: boolean,
8
  latency: number,
9
  configKey: string,
10
  similarity?: number
11
): Promise<void> {
12
  await cw.send(new PutMetricDataCommand({
13
    Namespace: "Agent/LLMCache",
14
    MetricData: [
15
      {
16
        MetricName: hit ? "CacheHits" : "CacheMisses",
17
        Value: 1,
18
        Unit: "Count",
19
        Dimensions: [{ Name: "ModelConfig", Value: configKey }],
20
      },
21
      {
22
        MetricName: "Latency",
23
        Value: latency,
24
        Unit: "Milliseconds",
25
        Dimensions: [{ Name: "CacheStatus", Value: hit ? "hit" : "miss" }],
26
      },
27
      ...(similarity !== undefined
28
        ? [{
29
            MetricName: "SimilarityScore",
30
            Value: similarity,
31
            Unit: "None",
32
            Dimensions: [{ Name: "ModelConfig", Value: configKey }],
33
          }]
34
        : []),
35
    ],
36
  }));
37
}

Step 6: Cache Invalidation Strategies#

Not all prompts should be cached. Some need fresh responses every time:

When to skip cache	Example	Strategy
Time-sensitive	”What’s the current price of Bitcoin?”	Check if prompt contains time-sensitive keywords
User-specific	”What’s in my inbox?”	Include userId in cache key → user-specific caches
Write operations	”Send an email to John”	Bypass cache entirely for mutation intents
Context-dependent	”What did I just learn?”	Depends on full conversation context, not just latest prompt

1
function shouldBypassCache(prompt: string): boolean {
2
  const writeKeywords = ["send", "create", "delete", "update", "write", "schedule"];
3
  const timeSensitivePatterns = [
4
    /\b(current|today|now|latest|live)\b/i,
5
    /\bprice\b/i, /\brate\b/i, /\bstock\b/i,
6
    /\bweather\b/i,
7
  ];
8

9
  const isWrite = writeKeywords.some(k => prompt.toLowerCase().startsWith(k));
10
  const isTimeSensitive = timeSensitivePatterns.some(p => p.test(prompt));
11

12
  return isWrite || isTimeSensitive;
13
}

Step 7: Cost Analysis#

Without caching:#

500K LLM calls/month × Claude Sonnet
Avg 500 input tokens + 500 output tokens
Cost: 500K × ( $0.0015 +$ 0.0075) = $4,500/mo
Avg latency: 3-5 seconds

With semantic cache (40-70% hit rate):#

Metric	No cache	With cache	Saved
Calls to LLM	500K	150-300K	40-70%
Monthly cost	$4,500	$1,350-2,700	$1,800-3,150
Avg latency	3-5s	20-100ms	95%+
Embedding cost	$0	~$5-10	—
ElastiCache	$0	~$30	—

Payback: ~2 weeks for the cache infrastructure.#

Summary#

Layer	Technology	Cache scope	Latency	Cost reduction
Bedrock Prompt Cache	Built-in	5-min exact match	~100ms hit	~90%
Redis Semantic Cache	ElastiCache	Configurable semantic	~20-100ms hit	40-70%
Agent-level TTL	Custom	Per-config	N/A	Avoids stale

Checklist:#

Bedrock prompt caching enabled (it’s default — verify in CloudWatch)
ElastiCache Redis cluster created (Serverless or self-designed)
Redis vector search index created (FT.CREATE with VECTOR schema)
Embedding generation using Titan Embeddings configured
Semantic cache service integrated with agent runtime
Config-driven cache policies (per model, per use-case)
Cache invalidation strategies (time-sensitive, writes, context-dependent)
Monitoring: CloudWatch dashboard + custom metrics
Cost tracking: before/after comparison

Day	Topic
1	Deploy MCP Server on ECS Fargate ✅
2	Agent State with DynamoDB Global Tables ✅
3	LLM Caching with ElastiCache + Bedrock ✅
4	Serverless Agent with Lambda + Bedrock
5	Multi-Region Agent Routing with Route53
6	CI/CD for AI Agents with CodePipeline

Series: AWS for AI/Agent Developers. Day 3: LLM caching — Bedrock Prompt Caching (server-side, zero code) + Redis semantic caching (ElastiCache, config-driven, full control). Full TypeScript code, CloudWatch monitoring, cost analysis.