AI Agents trong Production — Day 4: A/B Testing Prompts & Configs

Ship code change không test là điên. Ship prompt change không test cũng vậy.

Một từ trong system prompt có thể biến agent từ “helpful” thành “hallucinating”. Nâng cấp LLM model (GPT-4o → GPT-4.1) thay đổi tool-calling accuracy 5-15%. Không có A/B testing là mò mẫm.

Bài này xây nền tảng experimentation cho AI agents:

1
Prompt Store → Experiment Manager → Gradual Rollout → Evaluation Pipeline

Vì sao phải A/B test prompts?#

Có thể sai ở đâu:

Đổi one-shot example trong prompt → phá edge cases âm thầm
System instruction mới tốt average nhưng tệ trên inputs cụ thể
Nâng cấp GPT-4o lên GPT-4.1 đổi format response tinh vi
Chỉnh tone nhẹ → agent mất độ tin cậy

Metric nào theo dõi:

Tool call accuracy (Day 1)
Latency (Day 1)
Token consumption (Day 1)
Error rate (Day 3)
Cache hit rate (Day 2)

Step 1: Prompt Store — Quản lý config versioned#

Thay vì hardcode prompts, lưu versioned configs.

`src/experiments/prompt-store.ts`#

1
// PromptVersion: id, name, content, tags, metadata (author, createdAt, parentId, model)
2
// PromptConfig: name, activeVersion, rolloutPercent
3

4
export class PromptStore {
5
  private versions = new Map<string, PromptVersion>();
6
  private configs = new Map<string, PromptConfig>();
7

8
  async save(version: Omit<PromptVersion, "id">): Promise<PromptVersion> {
9
    const id = crypto.createHash("sha256").update(version.content + Date.now()).digest("hex").slice(0, 16);
10
    const full: PromptVersion = { ...version, id };
11
    this.versions.set(id, full);
12
    await fs.writeFile(`./prompts/${full.name}-${id}.json`, JSON.stringify(full, null, 2));
13
    return full;
14
  }
15

16
  async activate(name: string, versionId: string, rolloutPercent = 100): Promise<void> {
17
    this.configs.set(name, { name, activeVersion: versionId, rolloutPercent });
18
    const v = this.versions.get(versionId)!;
19
    if (!v.tags.includes("active")) v.tags.push("active");
20
  }
21

22
  get(name: string, seed?: string): { version: PromptVersion; isTestVariant: boolean } | null {
23
    const config = this.configs.get(name);
24
    if (!config) return null;
25
    const version = this.versions.get(config.activeVersion);
26
    if (!version) return null;
27
    const isInRollout = seed ? this.isInPercentage(config.rolloutPercent, seed) : true;
28
    return { version, isTestVariant: !isInRollout };
29
  }
30

31
  async rollback(name: string): Promise<PromptVersion | null> {
32
    const versions = this.listVersions(name);
33
    if (versions.length < 2) return null;
34
    const current = versions[0], previous = versions[1];
35
    current.tags.push("rolled-back");
36
    await this.activate(name, previous.id, 100);
37
    return previous;
38
  }
39

40
  private isInPercentage(percent: number, seed: string): boolean {
41
    const hash = crypto.createHash("md5").update(seed).digest("hex");
42
    return parseInt(hash.slice(0, 8), 16) % 100 < percent;
43
  }
44

45
  listVersions(name: string): PromptVersion[] {
46
    return Array.from(this.versions.values()).filter(v => v.name === name)
47
      .sort((a, b) => b.metadata.createdAt - a.metadata.createdAt);
48
  }
49
}

Usage:#

1
const v1 = await store.save({
2
  name: "system-prompt",
3
  content: "You are a helpful GitHub issue manager...",
4
  tags: ["initial"],
5
  metadata: { author: "ei", createdAt: Date.now(), parentId: null, model: "gpt-4o", description: "Initial" },
6
});
7

8
const v2 = await store.save({
9
  name: "system-prompt",
10
  content: "You are a precise GitHub issue manager. Always validate issue numbers...",
11
  tags: ["experiment"],
12
  metadata: { author: "ei", createdAt: Date.now(), parentId: v1.id, model: "gpt-4.1", description: "Add validation" },
13
});
14

15
// 10% traffic cho v2
16
await store.activate("system-prompt", v2.id, 10);
17

18
// Trong runtime — deterministic split theo sessionId
19
const { version, isTestVariant } = store.get("system-prompt", sessionId)!;

Step 2: Experiment Manager — A/B Traffic Splitting#

`src/experiments/experiment-manager.ts`#

1
export interface ExperimentConfig {
2
  name: string;
3
  promptName: string;
4
  variants: { label: string; versionId: string; weight: number }[];
5
  metrics: string[];
6
  startAt: number;
7
  durationMs: number;
8
  minSampleSize: number;
9
  significanceLevel: number;
10
}
11

12
export class ExperimentManager {
13
  private experiments = new Map<string, ExperimentResult>();
14

15
  async start(config: ExperimentConfig): Promise<void> {
16
    const totalWeight = config.variants.reduce((s, v) => s + v.weight, 0);
17
    if (totalWeight !== 100) throw new Error(`Weights must sum to 100, got ${totalWeight}`);
18

19
    for (const variant of config.variants) {
20
      await store.activate(config.promptName, variant.versionId, variant.weight);
21
    }
22

23
    this.experiments.set(config.name, {
24
      name: config.name, status: "running",
25
      samplesPerVariant: Object.fromEntries(config.variants.map(v => [v.label, 0])),
26
      metricsPerVariant: Object.fromEntries(config.variants.map(v => [v.label, Object.fromEntries(config.metrics.map(m => [m, 0]))])),
27
      winner: null, confidence: null, startedAt: Date.now(), completedAt: null,
28
    });
29
  }
30

31
  record(name: string, variant: string, metrics: Record<string, number>): void {
32
    const exp = this.experiments.get(name);
33
    if (!exp || exp.status !== "running") return;
34
    exp.samplesPerVariant[variant]++;
35
    for (const [k, v] of Object.entries(metrics)) {
36
      const n = exp.samplesPerVariant[variant];
37
      exp.metricsPerVariant[variant][k] += (v - exp.metricsPerVariant[variant][k]) / n;
38
    }
39
  }
40
}

Step 3: Gradual Rollout#

Canary deployment từng bước:

1
export class GradualRollout {
2
  async start(plan: RolloutPlan): Promise<void> {
3
    // plan.steps = [{ percent: 1, durationMs, evaluationCriteria }, { percent: 5, ... }, ...]
4
  }
5

6
  async advance(name: string): Promise<boolean> {
7
    // Đánh giá step hiện tại — nếu metrics vượt threshold → next step
8
    // Nếu fail → auto-rollback
9
  }
10
}
11

12
// Ví dụ rollout plan:
13
await rollout.start({
14
  name: "system-prompt-v2",
15
  promptName: "system-prompt",
16
  versionId: v2.id,
17
  steps: [
18
    { percent: 1, durationMs: 3600_000, evaluationCriteria: { metric: "error_rate", threshold: 0.01 } },
19
    { percent: 5, durationMs: 7200_000, evaluationCriteria: { metric: "error_rate", threshold: 0.02 } },
20
    { percent: 25, durationMs: 86400_000, evaluationCriteria: { metric: "tool_accuracy", threshold: 0.85 } },
21
    { percent: 50, durationMs: 86400_000 },
22
    { percent: 100, durationMs: 0 },
23
  ],
24
});

Step 4: Evaluation Pipeline#

1
export interface EvalCase {
2
  input: string;                // User query
3
  expectedTool: string;         // Expected tool
4
  expectedArgs?: Record<string, unknown>;
5
  category: string;             // "happy-path" | "edge-case" | "error-case"
6
}
7

8
export class PromptEvaluator {
9
  private cases: EvalCase[] = [];
10

11
  async evaluate(promptVersion: string, executor: any): Promise<{ results: EvalResult[]; summary: EvalSummary }> {
12
    // Chạy từng test case → so sánh actual vs expected
13
    // Output: accuracy, latency, tokens, per-category breakdown
14
  }
15
}

Step 5: Admin API Endpoints#

1
GET  /experiments/prompts?name=system-prompt     → list versions
2
POST /experiments/prompts                          → save new version
3
POST /experiments/prompts/:name/activate           → activate with rollout %
4
POST /experiments/prompts/:name/rollback           → rollback
5
POST /experiments/start                            → start A/B experiment
6
GET  /experiments/:name                            → get result
7
POST /experiments/:name/cancel                     → cancel + restore control
8
POST /experiments/rollout                          → start gradual rollout
9
POST /experiments/rollout/:name/advance            → next step
10
POST /experiments/rollout/:name/rollback           → emergency rollback

Lưu ý Production#

Deterministic traffic splitting#

Luôn dùng seed ổn định (session ID, user ID). Cùng user phải thấy cùng variant.

1
// ✅ Đúng
2
const variant = hash(seed) % 100 < weight ? "treatment" : "control";
3

4
// ❌ Sai
5
const variant = Math.random() < 0.5 ? "treatment" : "control";

Auto-rollback thresholds#

1
const AUTO_ROLLBACK = {
2
  error_rate: { maxIncrease: 0.02 },
3
  latency_p95: { maxIncrease: 0.10 },
4
  tool_accuracy: { maxDecrease: 0.03 },
5
};

Prompt storage#

Database (không filesystem) cho multi-server
Redis pub/sub để notify tất cả replicas
Mỗi thay đổi = version mới — không edit in place

Checklist#

Prompts lưu versioned, không hardcode
Traffic split deterministic (session ID)
Experiment có success metrics rõ ràng
Auto-rollback thresholds configured
Evaluation test cases happy path + edge case
Rollback tested trước khi promote

Day	Chủ đề
1	Observability & Telemetry ✅
2	Caching Strategies ✅
3	Error Handling & Resilience ✅
4	A/B Testing Prompts & Configs ✅
5	Multi-Region & High Availability
6	Building an Internal Agent Platform

Series: AI Agents trong Production. Day 4: A/B testing platform với versioned prompt store, experiment manager, gradual rollout, và evaluator pipeline.