AWS cho AI/Agent Developers — Day 3: LLM Caching với ElastiCache + Bedrock

LLM calls là thứ đắt nhất và chậm nhất trong agent system. Một generation tốn $0.01-0.10 và 2-10 giây. Trong production, mỗi ms và cent nhân lên rất nhanh.

Hai caching strategies:

Bedrock Prompt Caching — Cache server-side, không cần code. Amazon cache exact-match prompts trong ~5 phút.
Semantic Cache trên Redis — Embed prompt → lưu vector → tìm semantic similarity. Works với mọi model, mọi provider.

Kết hợp → giảm 40-70% cost, 60-90% latency.

1
Agent ─▶ Semantic Cache ─▶ Redis ─▶ [miss] ─▶ Bedrock ─▶ LLM
2
         (ElastiCache)        │                    │
3
                              │                    │
4
                        similarity > 0.92     Prompt Cache
5
                        → return cached       (built-in, 5min)

Strategy 1: Bedrock Prompt Caching#

Mặc định có sẵn, không cần config. Cache window ~5 phút.

1
# Gửi duplicate prompt → Bedrock tự cache trong ~5 phút
2
aws bedrock-runtime converse \
3
  --model-id anthropic.claude-3-5-sonnet-20241022-v2:0 \
4
  --messages '[{"role":"user","content":[{"text":"What is MCP?"}]}]'

Hạn chế: Exact match thôi, model-specific, region-specific.

Strategy 2: Semantic Cache trên Redis#

Setup ElastiCache#

1
aws elasticache create-serverless-cache \
2
  --serverless-cache-name llm-semantic-cache \
3
  --engine redis --major-engine-version 7

Code chính#

1
export class SemanticCache {
2
  private redis: RedisClientType;
3
  private bedrock: BedrockRuntimeClient;
4
  private configs: Record<string, CacheConfig>;
5

6
  // Tìm cache hit
7
  async get(prompt: string, configKey = "claude-sonnet") {
8
    const embedding = await this.embedPrompt(prompt);
9

10
    const results = await this.redis.ft.search("idx:llm-cache",
11
      `@modelId:{${config}} @embedding:[VECTOR_RANGE 0.08 $vec]=>{$YIELD_DISTANCE:1}`,
12
      { PARAMS: { vec: vectorToString(embedding) }, SORTBY: "VECTOR_DISTANCE", LIMIT: {from:0, size:1} }
13
    );
14

15
    if (!results.total) return null;
16

17
    // Check similarity threshold + TTL
18
    const similarity = 1 - parseFloat(results.documents[0].value.distance);
19
    if (similarity < threshold || age > maxAge) return null;
20

21
    return { response, hitCount, similarity };
22
  }
23

24
  // Embedding dùng Amazon Titan Embeddings
25
  private async embedPrompt(prompt: string): Promise<number[]> {
26
    const result = await this.bedrock.send(new InvokeModelCommand({
27
      modelId: "amazon.titan-embed-text-v2:0",
28
      body: JSON.stringify({ inputText: prompt, dimensions: 1024, normalize: true }),
29
    }));
30
    return JSON.parse(new TextDecoder().decode(result.body)).embedding;
31
  }
32
}

Index Redis (run once)#

1
redis-cli -h <endpoint> FT.CREATE idx:llm-cache ON HASH PREFIX 1 cache: \
2
  SCHEMA prompt TEXT response TEXT \
3
  embedding VECTOR FLAT 6 TYPE FLOAT32 DIM 1024 DISTANCE_METRIC COSINE \
4
  modelId TAG timestamp NUMERIC hitCount NUMERIC

Config-Driven Cache Policies#

1
{
2
  "claude-sonnet": { "similarityThreshold": 0.92, "maxCacheAgeMs": 300000, "enabled": true },
3
  "claude-haiku":  { "similarityThreshold": 0.90, "maxCacheAgeMs": 60000,  "enabled": true },
4
  "code-review":   { "similarityThreshold": 0.85, "maxCacheAgeMs": 600000, "enabled": true }
5
}

Khi nào skip cache?#

Trường hợp	Ví dụ	Strategy
Time-sensitive	”Giá Bitcoin bây giờ?”	Check keywords: current, now, latest
User-specific	”Inbox của tôi?”	Include userId trong cache key
Write operations	”Gửi email cho John”	Bypass cache hoàn toàn
Context-dependent	”Tôi vừa học gì?”	Skip — phụ thuộc conversation context

1
function shouldBypass(prompt: string): boolean {
2
  const writeKw = ["send","create","delete","write","schedule"];
3
  const timeSensitive = [/\b(current|today|now|latest)\b/i, /\bprice\b/i];
4
  return writeKw.some(k => prompt.startsWith(k)) || timeSensitive.some(p => p.test(prompt));
5
}

Cost Analysis#

	Không cache	Có cache	Tiết kiệm
500K calls/tháng	$4,500	$1,350-2,700	$1,800-3,150
Avg latency	3-5s	20-100ms	95%+
Infra cost	$0	~$40 (ElastiCache + embedding)	—

Hoàn vốn sau ~2 tuần.

Checklist#

Bedrock prompt caching verified (CloudWatch)
ElastiCache Redis cluster tạo
Redis vector search index tạo
Embedding với Titan Embeddings
Semantic cache integration
Config-driven policies
Cache invalidation strategies
CloudWatch metrics + custom metrics
Cost tracking before/after

Day	Chủ đề
1	Deploy MCP Server lên ECS Fargate ✅
2	Agent State với DynamoDB Global Tables ✅
3	LLM Caching với ElastiCache + Bedrock ✅
4	Serverless Agent với Lambda + Bedrock
5	Multi-Region Agent Routing với Route53
6	CI/CD cho AI Agents với CodePipeline

Series: AWS cho AI/Agent Developers. Day 3: LLM caching với Bedrock Prompt Caching + Redis semantic cache. Config-driven, full control.