AI Agents in Production — Day 4: A/B Testing Prompts & Configs

You wouldn’t ship a code change without testing it. Why ship a prompt change without one?

A single word in a system prompt can flip agent behavior from “helpful” to “hallucinating.” An LLM model upgrade (GPT-4o → GPT-4.1) changes tool-calling accuracy by 5-15%. Without A/B testing, you’re flying blind.

This post builds an experimentation platform for AI agents:

1
┌────────────────────────────────────┐
2
│      Agent A/B Platform            │
3
│                                    │
4
│  ┌──────────────┐  ┌────────────┐ │
5
│  │ Prompt Store │  │ Experiment │ │
6
│  │ - Versioned  │  │ - Traffic  │ │
7
│  │ - Tagged     │  │ - Split    │ │
8
│  │ - Metadata   │  │ - Variants │ │
9
│  └──────────────┘  └────────────┘ │
10
│                                    │
11
│  ┌──────────────┐  ┌────────────┐ │
12
│  │ Evaluation   │  │ Rollout   │ │
13
│  │ - Score      │  │ - Gradual │ │
14
│  │ - Compare    │  │ - Canary  │ │
15
│  │ - Auto-decide│  │ - Rollback│ │
16
│  └──────────────┘  └────────────┘ │
17
└────────────────────────────────────┘

Step 1: Why A/B Test Prompts?#

Prompt engineering isn’t a one-time activity. It’s a continuous cycle:

1
Write → Test → Measure → Iterate → Ship

What can go wrong:

Changing one example in a few-shot prompt breaks edge cases silently
A new system instruction improves average quality but crashes on specific inputs
Upgrading from GPT-4o to GPT-4.1 changes response format subtly
A seemingly neutral tone tweak makes the agent less trustworthy

What we’re measuring:

Metric	What it tells you	Source
Tool call accuracy	Agent chooses the right tool?	Day 1 tracer
Latency	Response time impact	Day 1 metrics
Token consumption	Cost per request	Day 1 metrics
Cache hit rate	Prompt change affects repeat queries?	Day 2 cache
Error rate	Prompt causing more failures?	Day 3 error handler
User satisfaction	Subjective quality (human eval)	External

Step 2: The Prompt Store — Versioned Config Management#

Instead of hardcoding prompts, store them as versioned configs.

`src/experiments/prompt-store.ts`#

1
// src/experiments/prompt-store.ts — Versioned prompt and configuration store
2

3
import fs from "fs/promises";
4
import path from "path";
5
import crypto from "crypto";
6

7
export interface PromptVersion {
8
  id: string;              // sha256(content + timestamp)
9
  name: string;            // "system-prompt" | "issue-extractor" | etc.
10
  content: string;         // The actual prompt text
11
  tags: string[];          // ["production", "canary", "rolled-back"]
12
  metadata: {
13
    author: string;
14
    createdAt: number;
15
    parentId: string | null; // Previous version for diff tracking
16
    model: string;          // Target LLM model
17
    description: string;
18
  };
19
}
20

21
export interface PromptConfig {
22
  name: string;
23
  activeVersion: string;   // Current production version ID
24
  rolloutPercent: number;  // 0-100, traffic percentage for this variant
25
}
26

27
export class PromptStore {
28
  private versions: Map<string, PromptVersion> = new Map();
29
  private configs: Map<string, PromptConfig> = new Map();
30
  private storeDir: string;
31

32
  constructor(storeDir = "./prompts") {
33
    this.storeDir = storeDir;
34
  }
35

36
  async init(): Promise<void> {
37
    await fs.mkdir(this.storeDir, { recursive: true });
38
    await this.load();
39
  }
40

41
  /**
42
   * Save a new prompt version.
43
   */
44
  async save(version: Omit<PromptVersion, "id">): Promise<PromptVersion> {
45
    const id = crypto
46
      .createHash("sha256")
47
      .update(version.content + Date.now())
48
      .digest("hex")
49
      .slice(0, 16);
50

51
    const full: PromptVersion = { ...version, id };
52
    this.versions.set(id, full);
53

54
    const filePath = path.join(this.storeDir, `${full.name}-${id}.json`);
55
    await fs.writeFile(filePath, JSON.stringify(full, null, 2));
56

57
    return full;
58
  }
59

60
  /**
61
   * Activate a version for production traffic.
62
   */
63
  async activate(name: string, versionId: string, rolloutPercent = 100): Promise<void> {
64
    if (!this.versions.has(versionId)) {
65
      throw new Error(`Version ${versionId} not found`);
66
    }
67

68
    this.configs.set(name, { name, activeVersion: versionId, rolloutPercent });
69

70
    // Tag the version
71
    const version = this.versions.get(versionId)!;
72
    if (!version.tags.includes("active")) version.tags.push("active");
73
    await this.persistVersion(version);
74
  }
75

76
  /**
77
   * Get the prompt content for a given name.
78
   * If rollout < 100%, returns based on traffic splitting.
79
   */
80
  get(name: string, seed?: string): { version: PromptVersion; isTestVariant: boolean } | null {
81
    // Default: return active version
82
    const config = this.configs.get(name);
83
    if (!config) return null;
84

85
    const version = this.versions.get(config.activeVersion);
86
    if (!version) return null;
87

88
    // Rollout check — deterministic based on seed
89
    const isInRollout = seed
90
      ? this.isInPercentage(config.rolloutPercent, seed)
91
      : true;
92

93
    return {
94
      version,
95
      isTestVariant: !isInRollout,
96
    };
97
  }
98

99
  /**
100
   * List all versions for a prompt.
101
   */
102
  listVersions(name: string): PromptVersion[] {
103
    return Array.from(this.versions.values())
104
      .filter(v => v.name === name)
105
      .sort((a, b) => b.metadata.createdAt - a.metadata.createdAt);
106
  }
107

108
  /**
109
   * Rollback to a previous version.
110
   */
111
  async rollback(name: string): Promise<PromptVersion | null> {
112
    const versions = this.listVersions(name);
113
    if (versions.length < 2) return null;
114

115
    const current = versions[0];
116
    const previous = versions[1];
117

118
    // Tag current as rolled-back
119
    current.tags.push("rolled-back");
120
    await this.persistVersion(current);
121

122
    // Activate previous
123
    await this.activate(name, previous.id, 100);
124
    return previous;
125
  }
126

127
  // ──── Private ────
128

129
  private async load(): Promise<void> {
130
    const files = await fs.readdir(this.storeDir).catch(() => []);
131
    for (const file of files) {
132
      if (!file.endsWith(".json")) continue;
133
      const data = await fs.readFile(path.join(this.storeDir, file), "utf-8");
134
      const version: PromptVersion = JSON.parse(data);
135
      this.versions.set(version.id, version);
136
      if (version.tags.includes("active")) {
137
        this.configs.set(version.name, {
138
          name: version.name,
139
          activeVersion: version.id,
140
          rolloutPercent: 100,
141
        });
142
      }
143
    }
144
  }
145

146
  private async persistVersion(version: PromptVersion): Promise<void> {
147
    const filePath = path.join(this.storeDir, `${version.name}-${version.id}.json`);
148
    await fs.writeFile(filePath, JSON.stringify(version, null, 2));
149
  }
150

151
  /**
152
   * Deterministic percentage check using a seed string.
153
   * Same seed + same percentage = same result every time.
154
   */
155
  private isInPercentage(percent: number, seed: string): boolean {
156
    const hash = crypto.createHash("md5").update(seed).digest("hex");
157
    const num = parseInt(hash.slice(0, 8), 16) % 100;
158
    return num < percent;
159
  }
160
}

Usage:#

1
const store = new PromptStore("./prompts");
2
await store.init();
3

4
// Save production prompt
5
const v1 = await store.save({
6
  name: "system-prompt",
7
  content: "You are a helpful GitHub issue manager...",
8
  tags: ["initial"],
9
  metadata: {
10
    author: "ei",
11
    createdAt: Date.now(),
12
    parentId: null,
13
    model: "gpt-4o",
14
    description: "Initial system prompt",
15
  },
16
});
17

18
// Save experimental variant
19
const v2 = await store.save({
20
  name: "system-prompt",
21
  content: "You are a precise GitHub issue manager. Always validate issue numbers exist before referencing them...",
22
  tags: ["experiment"],
23
  metadata: {
24
    author: "ei",
25
    createdAt: Date.now(),
26
    parentId: v1.id,
27
    model: "gpt-4.1",
28
    description: "Add validation instructions",
29
  },
30
});
31

32
// Activate v2 for 10% of traffic
33
await store.activate("system-prompt", v2.id, 10);
34

35
// In agent runtime — deterministic split by session ID
36
const { version, isTestVariant } = store.get("system-prompt", sessionId)!;

Step 3: Experiment Manager — A/B Traffic Splitting#

`src/experiments/experiment-manager.ts`#

1
// src/experiments/experiment-manager.ts — A/B experiment lifecycle
2

3
import { PromptStore, PromptVersion } from "./prompt-store.js";
4

5
export interface ExperimentConfig {
6
  name: string;                    // "system-prompt-v2-vs-v1"
7
  description: string;
8
  promptName: string;              // Which prompt to experiment on
9
  variants: {
10
    label: string;                 // "control" | "treatment"
11
    versionId: string;
12
    weight: number;                // Traffic share (must sum to 100)
13
  }[];
14
  metrics: string[];               // ["tool_accuracy", "latency_p50", "error_rate"]
15
  startAt: number;
16
  durationMs: number;              // Auto-stop after this duration
17
  minSampleSize: number;           // Minimum requests before declaring result
18
  significanceLevel: number;       // 0.05 = 95% confidence
19
}
20

21
export interface ExperimentResult {
22
  name: string;
23
  status: "running" | "completed" | "cancelled";
24
  samplesPerVariant: Record<string, number>;
25
  metricsPerVariant: Record<string, Record<string, number>>;
26
  winner: string | null;          // Winning variant label, or null if inconclusive
27
  confidence: number | null;
28
  startedAt: number;
29
  completedAt: number | null;
30
}
31

32
export class ExperimentManager {
33
  private experiments: Map<string, ExperimentResult> = new Map();
34
  private store: PromptStore;
35

36
  constructor(store: PromptStore) {
37
    this.store = store;
38
  }
39

40
  /**
41
   * Start a new A/B experiment.
42
   */
43
  async start(config: ExperimentConfig): Promise<void> {
44
    // Validate weights sum to 100
45
    const totalWeight = config.variants.reduce((s, v) => s + v.weight, 0);
46
    if (totalWeight !== 100) {
47
      throw new Error(`Variant weights must sum to 100, got ${totalWeight}`);
48
    }
49

50
    // Verify all version IDs exist
51
    for (const variant of config.variants) {
52
      const version = this.store.get(config.promptName);
53
      if (!version) {
54
        throw new Error(`Prompt "${config.promptName}" has no active version`);
55
      }
56
    }
57

58
    // Activate variants with their weights
59
    for (const variant of config.variants) {
60
      await this.store.activate(config.promptName, variant.versionId, variant.weight);
61
    }
62

63
    // Track experiment state
64
    this.experiments.set(config.name, {
65
      name: config.name,
66
      status: "running",
67
      samplesPerVariant: Object.fromEntries(config.variants.map(v => [v.label, 0])),
68
      metricsPerVariant: Object.fromEntries(
69
        config.variants.map(v => [v.label, Object.fromEntries(config.metrics.map(m => [m, 0]))])
70
      ),
71
      winner: null,
72
      confidence: null,
73
      startedAt: Date.now(),
74
      completedAt: null,
75
    });
76
  }
77

78
  /**
79
   * Record a data point for an experiment.
80
   */
81
  record(
82
    experimentName: string,
83
    variantLabel: string,
84
    metrics: Record<string, number>
85
  ): void {
86
    const exp = this.experiments.get(experimentName);
87
    if (!exp || exp.status !== "running") return;
88

89
    exp.samplesPerVariant[variantLabel]++;
90

91
    for (const [key, value] of Object.entries(metrics)) {
92
      if (key in exp.metricsPerVariant[variantLabel]) {
93
        // Running average
94
        const n = exp.samplesPerVariant[variantLabel];
95
        const current = exp.metricsPerVariant[variantLabel][key];
96
        exp.metricsPerVariant[variantLabel][key] = current + (value - current) / n;
97
      }
98
    }
99
  }
100

101
  /**
102
   * Check if experiment has enough data and stop.
103
   */
104
  evaluate(experimentName: string): ExperimentResult | null {
105
    const exp = this.experiments.get(experimentName);
106
    if (!exp) return null;
107

108
    // Check if minimum sample size reached
109
    const minSamples = Math.min(...Object.values(exp.samplesPerVariant));
110
    if (minSamples < 100) return exp; // Not enough data yet
111

112
    // Stub: winner detection via metric comparison
113
    // In production, use proper statistical tests (chi-square, t-test)
114
    const [control, treatment] = Object.keys(exp.metricsPerVariant);
115
    const controlScore = Object.values(exp.metricsPerVariant[control]).reduce((a, b) => a + b, 0);
116
    const treatmentScore = Object.values(exp.metricsPerVariant[treatment]).reduce((a, b) => a + b, 0);
117

118
    if (Math.abs(controlScore - treatmentScore) > 0.05) {
119
      exp.winner = controlScore > treatmentScore ? control : treatment;
120
      exp.confidence = 0.95;
121
      exp.status = "completed";
122
      exp.completedAt = Date.now();
123
    }
124

125
    return exp;
126
  }
127

128
  /**
129
   * Cancel experiment and restore control to 100%.
130
   */
131
  async cancel(experimentName: string, promptName: string): Promise<void> {
132
    const exp = this.experiments.get(experimentName);
133
    if (!exp) return;
134

135
    exp.status = "cancelled";
136
    exp.completedAt = Date.now();
137

138
    // Restore control version to 100%
139
    const controlVersion = this.store.listVersions(promptName).find(
140
      v => v.tags.includes("initial") || v.tags.includes("active")
141
    );
142
    if (controlVersion) {
143
      await this.store.activate(promptName, controlVersion.id, 100);
144
    }
145
  }
146

147
  getExperiment(name: string): ExperimentResult | null {
148
    return this.experiments.get(name) ?? null;
149
  }
150

151
  listExperiments(): ExperimentResult[] {
152
    return Array.from(this.experiments.values());
153
  }
154
}

Step 4: Gradual Rollout Strategy#

Instead of flipping a switch, roll out changes gradually:

1
// src/experiments/rollout.ts — Gradual rollout (canary deployment)
2

3
export interface RolloutPlan {
4
  name: string;
5
  promptName: string;
6
  versionId: string;
7
  steps: {
8
    percent: number;
9
    durationMs: number;
10
    evaluationCriteria?: { metric: string; threshold: number };
11
  }[];
12
}
13

14
export class GradualRollout {
15
  private store: PromptStore;
16
  private activeRollouts: Map<string, RolloutPlan & { currentStep: number; startedAt: number }> = new Map();
17

18
  constructor(store: PromptStore) {
19
    this.store = store;
20
  }
21

22
  async start(plan: RolloutPlan): Promise<void> {
23
    this.activeRollouts.set(plan.name, { ...plan, currentStep: 0, startedAt: Date.now() });
24
    await this.applyStep(plan, 0);
25
  }
26

27
  /**
28
   * Advance to next step — called by a scheduler or on agent startup.
29
   */
30
  async advance(name: string): Promise<boolean> {
31
    const rollout = this.activeRollouts.get(name);
32
    if (!rollout) return false;
33

34
    const nextStep = rollout.currentStep + 1;
35
    if (nextStep >= rollout.steps.length) {
36
      await this.store.activate(rollout.promptName, rollout.versionId, 100);
37
      this.activeRollouts.delete(name);
38
      return true; // Fully rolled out
39
    }
40

41
    // Check evaluation criteria for current step
42
    const currentStepConfig = rollout.steps[rollout.currentStep];
43
    if (currentStepConfig.evaluationCriteria) {
44
      // Fetch metrics and decide whether to proceed
45
      const passed = await this.evaluateStep(rollout, currentStepConfig);
46
      if (!passed) {
47
        // Auto-rollback
48
        await this.rollback(name);
49
        return false;
50
      }
51
    }
52

53
    await this.applyStep(rollout, nextStep);
54
    rollout.currentStep = nextStep;
55
    return false;
56
  }
57

58
  private async applyStep(rollout: RolloutPlan & { currentStep: number; startedAt: number }, stepIndex: number): Promise<void> {
59
    const step = rollout.steps[stepIndex];
60
    await this.store.activate(rollout.promptName, rollout.versionId, step.percent);
61
    console.log(`[Rollout] ${rollout.name}: ${step.percent}% (step ${stepIndex + 1}/${rollout.steps.length})`);
62
  }
63

64
  private async evaluateStep(
65
    rollout: RolloutPlan & { currentStep: number; startedAt: number },
66
    criteria: { metric: string; threshold: number }
67
  ): Promise<boolean> {
68
    // Stub — fetch from metrics endpoint
69
    console.log(`[Rollout] Evaluating ${rollout.name}: ${criteria.metric} > ${criteria.threshold}`);
70
    return true;
71
  }
72

73
  async rollback(name: string): Promise<void> {
74
    const rollout = this.activeRollouts.get(name);
75
    if (!rollout) return;
76

77
    console.warn(`[Rollout] ${name}: ROLLING BACK`);
78
    // Find the previous stable version
79
    const versions = this.store.listVersions(rollout.promptName);
80
    const stable = versions.find(v => v.tags.includes("active") && v.id !== rollout.versionId)
81
      || versions[versions.length - 1];
82

83
    if (stable) {
84
      await this.store.activate(rollout.promptName, stable.id, 100);
85
    }
86

87
    this.activeRollouts.delete(name);
88
  }
89
}

Rollout plan example:#

1
const rollout = new GradualRollout(store);
2

3
await rollout.start({
4
  name: "system-prompt-v2",
5
  promptName: "system-prompt",
6
  versionId: v2.id,
7
  steps: [
8
    { percent: 1, durationMs: 3600_000, evaluationCriteria: { metric: "error_rate", threshold: 0.01 } },
9
    { percent: 5, durationMs: 7200_000, evaluationCriteria: { metric: "error_rate", threshold: 0.02 } },
10
    { percent: 25, durationMs: 86400_000, evaluationCriteria: { metric: "tool_accuracy", threshold: 0.85 } },
11
    { percent: 50, durationMs: 86400_000 },
12
    { percent: 100, durationMs: 0 },
13
  ],
14
});

Step 5: Evaluation Pipeline#

`src/experiments/evaluator.ts`#

1
// src/experiments/evaluator.ts — Automated prompt evaluation
2

3
export interface EvalCase {
4
  input: string;          // User query
5
  expectedTool: string;   // Expected tool name
6
  expectedArgs?: Record<string, unknown>;
7
  expectedResponse?: string;
8
  category: string;       // "edge-case" | "happy-path" | "error-case"
9
}
10

11
export interface EvalResult {
12
  case: EvalCase;
13
  actualTool: string;
14
  matchedTool: boolean;
15
  latencyMs: number;
16
  tokensUsed: number;
17
  error: string | null;
18
}
19

20
export class PromptEvaluator {
21
  private cases: EvalCase[] = [];
22

23
  addCase(testCase: EvalCase): void {
24
    this.cases.push(testCase);
25
  }
26

27
  addBatch(testCases: EvalCase[]): void {
28
    this.cases.push(...testCases);
29
  }
30

31
  async evaluate(
32
    promptVersion: string,
33
    executor: (input: string) => Promise<{ tool: string; latencyMs: number; tokens: number }>
34
  ): Promise<{ results: EvalResult[]; summary: EvalSummary }> {
35
    const results: EvalResult[] = [];
36

37
    for (const testCase of this.cases) {
38
      const start = Date.now();
39
      try {
40
        const response = await executor(testCase.input);
41
        results.push({
42
          case: testCase,
43
          actualTool: response.tool,
44
          matchedTool: response.tool === testCase.expectedTool,
45
          latencyMs: response.latencyMs,
46
          tokensUsed: response.tokens,
47
          error: null,
48
        });
49
      } catch (error) {
50
        results.push({
51
          case: testCase,
52
          actualTool: "error",
53
          matchedTool: false,
54
          latencyMs: Date.now() - start,
55
          tokensUsed: 0,
56
          error: String(error),
57
        });
58
      }
59
    }
60

61
    return {
62
      results,
63
      summary: this.aggregate(results),
64
    };
65
  }
66

67
  private aggregate(results: EvalResult[]): EvalSummary {
68
    const total = results.length;
69
    const correct = results.filter(r => r.matchedTool).length;
70
    const byCategory = this.groupByCategory(results);
71

72
    return {
73
      totalCases: total,
74
      accuracy: total > 0 ? correct / total : 0,
75
      totalErrors: results.filter(r => r.error).length,
76
      avgLatencyMs: results.reduce((s, r) => s + r.latencyMs, 0) / total,
77
      avgTokens: results.reduce((s, r) => s + r.tokensUsed, 0) / total,
78
      byCategory: Object.fromEntries(
79
        Array.from(byCategory.entries()).map(([cat, items]) => [
80
          cat,
81
          {
82
            accuracy: items.filter(r => r.matchedTool).length / items.length,
83
            count: items.length,
84
          },
85
        ])
86
      ),
87
    };
88
  }
89

90
  private groupByCategory(results: EvalResult[]): Map<string, EvalResult[]> {
91
    const map = new Map<string, EvalResult[]>();
92
    for (const r of results) {
93
      const list = map.get(r.case.category) || [];
94
      list.push(r);
95
      map.set(r.case.category, list);
96
    }
97
    return map;
98
  }
99
}
100

101
export interface EvalSummary {
102
  totalCases: number;
103
  accuracy: number;
104
  totalErrors: number;
105
  avgLatencyMs: number;
106
  avgTokens: number;
107
  byCategory: Record<string, { accuracy: number; count: number }>;
108
}

Step 6: Admin API Endpoints#

1
import { PromptStore } from "./experiments/prompt-store.js";
2
import { ExperimentManager } from "./experiments/experiment-manager.js";
3
import { GradualRollout } from "./experiments/rollout.js";
4
import { PromptEvaluator, EvalCase } from "./experiments/evaluator.js";
5

6
const store = new PromptStore();
7
const experiments = new ExperimentManager(store);
8
const rollouts = new GradualRollout(store);
9
const evaluator = new PromptEvaluator();
10

11
// Store endpoints
12
app.get("/experiments/prompts", (req, res) => {
13
  const name = req.query.name as string;
14
  res.json({ versions: store.listVersions(name) });
15
});
16

17
app.post("/experiments/prompts", async (req, res) => {
18
  const version = await store.save({
19
    name: req.body.name,
20
    content: req.body.content,
21
    tags: req.body.tags || [],
22
    metadata: req.body.metadata,
23
  });
24
  res.json(version);
25
});
26

27
app.post("/experiments/prompts/:name/activate", async (req, res) => {
28
  const { versionId, rolloutPercent } = req.body;
29
  await store.activate(req.params.name, versionId, rolloutPercent || 100);
30
  res.json({ status: "activated" });
31
});
32

33
app.post("/experiments/prompts/:name/rollback", async (req, res) => {
34
  const prev = await store.rollback(req.params.name);
35
  res.json({ status: "rolled-back", version: prev });
36
});
37

38
// Experiment endpoints
39
app.post("/experiments/start", async (req, res) => {
40
  await experiments.start(req.body);
41
  res.json({ status: "started" });
42
});
43

44
app.get("/experiments/:name", (req, res) => {
45
  res.json(experiments.evaluate(req.params.name));
46
});
47

48
app.post("/experiments/:name/cancel", async (req, res) => {
49
  await experiments.cancel(req.params.name, req.body.promptName);
50
  res.json({ status: "cancelled" });
51
});
52

53
// Rollout endpoints
54
app.post("/experiments/rollout", async (req, res) => {
55
  await rollouts.start(req.body);
56
  res.json({ status: "rollout-started" });
57
});
58

59
app.post("/experiments/rollout/:name/advance", async (req, res) => {
60
  const completed = await rollouts.advance(req.params.name);
61
  res.json({ status: completed ? "completed" : "advanced" });
62
});
63

64
app.post("/experiments/rollout/:name/rollback", async (req, res) => {
65
  await rollouts.rollback(req.params.name);
66
  res.json({ status: "rolled-back" });
67
});
68

69
// Evaluation
70
app.post("/experiments/evaluate", async (req, res) => {
71
  const { results, summary } = await evaluator.evaluate(
72
    req.body.promptVersion,
73
    req.body.executor
74
  );
75
  res.json({ results, summary });
76
});

Step 7: Integration with Agent Runtime#

1
// In agent runtime — resolve prompt at request time
2
function buildAgentSystemPrompt(sessionId: string): string {
3
  // Traffic-split by session ID for deterministic routing
4
  const result = store.get("system-prompt", sessionId);
5
  const content = result?.version.content || DEFAULT_SYSTEM_PROMPT;
6

7
  // Record which variant this session saw
8
  logger.info("prompt_resolved", {
9
    sessionId,
10
    promptVersion: result?.version.id,
11
    variant: result?.isTestVariant ? "test" : "control",
12
  });
13

14
  return content;
15
}

What a Good Evaluation Report Looks Like#

1
=== Prompt Evaluation: system-prompt-v2 (hypothetical) ===
2

3
Overall Accuracy: 91.4%  (+4.2% vs baseline ✅)
4

5
By Category:
6
  happy-path:  96.8% (+2.1%)
7
  edge-case:   82.5% (+8.9% ✅ — major improvement)
8
  error-case:  88.9% (+1.0%)
9

10
Latency: Avg 1,432ms (+120ms, p95 within limits)
11
Tokens:  Avg 412 (-18 tokens ✅ cheaper)
12

13
Rollout Decision: PROCEED — meeting all thresholds
14
Next Step: Increase to 25% (currently at 5%)

Comparison: A/B Testing Approaches#

Approach	What changes	Traffic split	Risk	Time to result
Manual edit	Direct prompt edit	100% instantly	High	Instant
Version rollback	Switch between versions	100% flip	Medium	Seconds
Canary rollout	Gradual % increase	1→5→25→50→100%	Low	Days
A/B experiment	Random split 50/50	50% control, 50% test	Low	Hours
Shadow testing	Run both, compare offline	0% user-facing	Minimal	Days

Production Considerations#

Deterministic traffic splitting#

Always use a stable seed (session ID, user ID) for variant assignment. The same user should see the same variant across requests — otherwise you get inconsistent experiences.

1
// Good — deterministic
2
const variant = hash(seed) % 100 < experimentWeight ? "treatment" : "control";
3

4
// Bad — random
5
const variant = Math.random() < 0.5 ? "treatment" : "control";

Metrics fatigue#

Don’t track 20 metrics per experiment. Pick 3-5 that matter. More metrics = higher chance of false positives.

Auto-rollback thresholds#

Set maximum acceptable degradation for each metric:

1
const AUTO_ROLLBACK_CONFIG = {
2
  error_rate: { maxIncrease: 0.02 },      // +2% max
3
  latency_p95: { maxIncrease: 0.10 },      // +10% max
4
  token_cost: { maxIncrease: 0.20 },       // +20% max
5
  tool_accuracy: { maxDecrease: 0.03 },    // -3% max
6
};

Prompt storage#

Store prompts in a database, not the filesystem, in multi-server setups
Use Redis pub/sub to notify all replicas of version changes
Every prompt change is a new version — never edit in place

Summary#

Concept	Implementation	Benefit
Versioned prompt store	`PromptStore` with tags + metadata	Every change is traceable
Traffic splitting	Deterministic hash-based assignment	Consistent user experience
A/B experiments	`ExperimentManager` with weights + auto-evaluation	Data-driven decisions
Gradual rollout	`GradualRollout` with canary steps + auto-rollback	Low-risk deployment
Eval pipeline	`PromptEvaluator` with test cases + summary	Quantified quality
Admin API	Full CRUD for prompts, experiments, rollouts	Human-in-the-loop

Checklist:#

Prompts stored as versioned configs (not hardcoded)
Traffic splitting uses deterministic seed (session ID)
Experiment has clear success metrics
Minimum sample size defined before declaring winner
Auto-rollback thresholds configured
Evaluation test cases cover happy path + edge cases
Rollback tested before promotion
Fallback to stable version on error

Day	Topic
1	Observability & Telemetry ✅
2	Caching Strategies ✅
3	Error Handling & Resilience ✅
4	A/B Testing Prompts & Configs ✅
5	Multi-Region & High Availability
6	Building an Internal Agent Platform

Series: AI Agents in Production. Day 4: A/B testing platform with versioned prompt store, experiment manager, gradual rollout, and automated evaluation pipeline. Full TypeScript source code included.