AI Agents in Production — Day 1: Observability & Telemetry

Your agent runs. It calls tools. It talks to the LLM. But what actually happens inside? How many tokens did the last task cost? Did it get stuck in a loop? Did it call delete_repository accidentally?

Without observability, your agent is a black box. You only notice problems when something breaks — or when the bill arrives.

In this first post of the “AI Agents in Production” series, we instrument an MCP-based agent from scratch and build a complete observability layer.

What We’re Building#

1
┌─────────────────────────────────────┐
2
│         Agent Runtime                │
3
│  ┌─────────┐  ┌──────────┐          │
4
│  │ LLM     │  │ MCP Tools │          │
5
│  │ Calls   │  │ Calls     │          │
6
│  └────┬────┘  └────┬─────┘          │
7
│       │            │                 │
8
│  ┌────▼────────────▼─────┐           │
9
│  │   Telemetry Middleware │           │
10
│  └────┬──────────────────┘           │
11
│       │                              │
12
│  ┌────▼────┐                         │
13
│  │  Logs   │ → stdout / file         │
14
│  │  Traces │ → OpenTelemetry         │
15
│  │  Metrics│ → Prometheus            │
16
│  └─────────┘                         │
17
└─────────────────────────────────────┘

Three observability signals:

Signal	What	Where it goes
Logs	Structured JSON per event	stdout, file, OpenObserve
Traces	Tool call chains with duration	OpenTelemetry collector
Metrics	Token count, costs, error rates	Prometheus + Grafana

Step 1: The Problem — An Uninstrumented Agent#

Here’s a typical MCP agent interaction. It looks simple:

1
// Uninstrumented — no visibility
2
const result = await tool.call("search_issues", {
3
  query: "repo:owner/name is:open bug"
4
});

But questions you can’t answer:

How long did this tool call take?
How many tokens did the LLM spend deciding which tool to call?
Did it retry? Was the first attempt a failure?
What’s the running cost of this session?

Let’s fix this.

Step 2: Structured Logging for Tool Calls#

The foundation of observability is structured logs — every event is a JSON object, not a text string.

Create `src/telemetry/logger.ts`#

1
// src/telemetry/logger.ts — Structured logger for agent events
2

3
type LogLevel = "debug" | "info" | "warn" | "error";
4

5
interface LogEvent {
6
  timestamp: string;
7
  level: LogLevel;
8
  sessionId: string;
9
  event: string;
10
  data: Record<string, unknown>;
11
  duration_ms?: number;
12
  error?: string;
13
}
14

15
export class AgentLogger {
16
  private sessionId: string;
17
  private startTime: number;
18

19
  constructor(sessionId?: string) {
20
    this.sessionId = sessionId || crypto.randomUUID();
21
    this.startTime = Date.now();
22
  }
23

24
  private log(level: LogLevel, event: string, data: Record<string, unknown>, error?: string) {
25
    const entry: LogEvent = {
26
      timestamp: new Date().toISOString(),
27
      level,
28
      sessionId: this.sessionId,
29
      event,
30
      data,
31
      duration_ms: Date.now() - this.startTime,
32
    };
33
    if (error) entry.error = error;
34
    console.log(JSON.stringify(entry));
35

36
    // Also keep in-memory for the session
37
    this.events.push(entry);
38
  }
39

40
  // In-memory buffer for session replay
41
  private events: LogEvent[] = [];
42

43
  info(event: string, data: Record<string, unknown> = {}) { this.log("info", event, data); }
44
  warn(event: string, data: Record<string, unknown> = {}) { this.log("warn", event, data); }
45
  error(event: string, data: Record<string, unknown> = {}, error?: string) { this.log("error", event, data, error); }
46
  debug(event: string, data: Record<string, unknown> = {}) { this.log("debug", event, data); }
47

48
  getEvents(): LogEvent[] { return this.events; }
49
  getSessionStats() {
50
    return {
51
      sessionId: this.sessionId,
52
      totalEvents: this.events.length,
53
      errors: this.events.filter(e => e.level === "error").length,
54
      duration_ms: Date.now() - this.startTime,
55
    };
56
  }
57
}

Why structured logs matter:#

1
# Text log — useless for analysis
2
[INFO] Tool called: search_issues
3

4
# Structured log — queryable, filterable, aggregatable
5
{"timestamp":"2026-06-13T01:23:45Z","level":"info","sessionId":"sess_abc",
6
 "event":"tool_call","data":{"tool":"search_issues","input":{"query":"..."},"duration_ms":340}}

With JSON logs, you can pipe them into any log aggregator (OpenObserve, Loki, Datadog) and query:

1
SELECT avg(duration_ms), count(*) FROM logs
2
WHERE event = 'tool_call' AND level = 'error'
3
GROUP BY tool

Step 3: Tool Call Wrapper with Full Telemetry#

This is the core — a wrapper that intercepts every tool call, tracks everything, and handles errors.

Create `src/telemetry/tool-tracer.ts`#

1
// src/telemetry/tool-tracer.ts — Instrumented tool call wrapper
2

3
import { AgentLogger } from "./logger.js";
4
import crypto from "crypto";
5

6
/**
7
 * Wraps an MCP tool call with full telemetry:
8
 * - Input/output logging
9
 * - Duration tracking
10
 * - Token approximation
11
 * - Error capture
12
 * - Cost estimation
13
 */
14
export class ToolTracer {
15
  private logger: AgentLogger;
16

17
  // Approximate prices per 1K tokens (as of mid-2026)
18
  private static readonly MODEL_PRICES: Record<string, { input: number; output: number }> = {
19
    "claude-sonnet-4-20250514": { input: 0.003, output: 0.015 },
20
    "claude-haiku-3-20250313": { input: 0.00025, output: 0.00125 },
21
    "gpt-4o-mini": { input: 0.00015, output: 0.0006 },
22
    "gpt-4o": { input: 0.0025, output: 0.01 },
23
    "deepseek-chat": { input: 0.00027, output: 0.0011 },
24
  };
25

26
  constructor(logger: AgentLogger) {
27
    this.logger = logger;
28
  }
29

30
  /**
31
   * Roughly estimate tokens from text.
32
   * ~4 chars per token for English, ~2 for CJK.
33
   */
34
  static estimateTokens(text: string): number {
35
    // Count words and estimate
36
    const words = text.split(/\s+/).length;
37
    const chars = text.length;
38
    return Math.ceil((words + chars / 4) / 2);
39
  }
40

41
  /**
42
   * Calculate estimated cost for a model call.
43
   */
44
  static estimateCost(
45
    model: string,
46
    inputTokens: number,
47
    outputTokens: number
48
  ): number {
49
    const prices = ToolTracer.MODEL_PRICES[model];
50
    if (!prices) return 0; // Unknown model — can't estimate
51
    return (inputTokens / 1000) * prices.input + (outputTokens / 1000) * prices.output;
52
  }
53

54
  /**
55
   * Wrap a tool call with telemetry.
56
   * Returns the result or throws with full context.
57
   */
58
  async traceToolCall<T>(
59
    toolName: string,
60
    input: Record<string, unknown>,
61
    callFn: () => Promise<T>,
62
    options: {
63
      model?: string;
64
      llmInputTokens?: number;
65
      llmOutputTokens?: number;
66
      maxRetries?: number;
67
    } = {}
68
  ): Promise<{ result: T; traceId: string; durationMs: number }> {
69
    const traceId = crypto.randomUUID().slice(0, 8);
70
    const startTime = Date.now();
71
    const serializedInput = JSON.stringify(input);
72

73
    // Log the call start
74
    this.logger.info("tool_call_start", {
75
      tool: toolName,
76
      traceId,
77
      input: serializedInput.slice(0, 1000), // Truncate for logs
78
      inputSize: serializedInput.length,
79
    });
80

81
    let lastError: Error | null = null;
82
    let attempt = 0;
83
    const maxAttempts = options.maxRetries || 1;
84

85
    while (attempt < maxAttempts) {
86
      attempt++;
87
      try {
88
        const result = await callFn();
89
        const durationMs = Date.now() - startTime;
90

91
        // Estimate token costs if we have model info
92
        const inputTokens = options.llmInputTokens || ToolTracer.estimateTokens(serializedInput);
93
        const outputTokens = options.llmOutputTokens || (
94
          typeof result === "string"
95
            ? ToolTracer.estimateTokens(result)
96
            : 0
97
        );
98
        const cost = options.model
99
          ? ToolTracer.estimateCost(options.model, inputTokens, outputTokens)
100
          : 0;
101

102
        // Log success
103
        this.logger.info("tool_call_complete", {
104
          tool: toolName,
105
          traceId,
106
          attempt,
107
          duration_ms: durationMs,
108
          inputTokens,
109
          outputTokens,
110
          costUsd: parseFloat(cost.toFixed(6)),
111
          resultSize: JSON.stringify(result).length,
112
        });
113

114
        return { result, traceId, durationMs };
115
      } catch (error) {
116
        lastError = error instanceof Error ? error : new Error(String(error));
117

118
        this.logger.warn("tool_call_retry", {
119
          tool: toolName,
120
          traceId,
121
          attempt,
122
          maxAttempts,
123
          error: lastError.message,
124
        });
125

126
        if (attempt >= maxAttempts) {
127
          // All retries exhausted
128
          this.logger.error("tool_call_failed", {
129
            tool: toolName,
130
            traceId,
131
            attempt,
132
            duration_ms: Date.now() - startTime,
133
          }, lastError.message);
134

135
          throw lastError;
136
        }
137

138
        // Wait before retry (exponential backoff)
139
        const delay = Math.min(1000 * Math.pow(2, attempt - 1) + Math.random() * 500, 16000);
140
        await new Promise(r => setTimeout(r, delay));
141
      }
142
    }
143

144
    throw lastError!; // Should never reach here
145
  }
146
}

Usage in an agent loop:#

1
const logger = new AgentLogger(sessionId);
2
const tracer = new ToolTracer(logger);
3

4
async function runAgent() {
5
  try {
6
    logger.info("session_start", { model: "claude-sonnet-4-20250514" });
7

8
    const { result, traceId, durationMs } = await tracer.traceToolCall(
9
      "search_issues",
10
      { query: "repo:owner/name bug", limit: 10 },
11
      () => mcpClient.callTool("search_issues", { query: "...", limit: 10 }),
12
      { model: "claude-sonnet-4-20250514", maxRetries: 3 }
13
    );
14

15
    logger.info("session_end", { traceId, finalDuration: durationMs });
16
    return result;
17
  } catch (error) {
18
    logger.error("session_failed", {}, String(error));
19
    throw error;
20
  }
21
}
22

23
// At the end, print session summary
24
console.log("\n📊 Session Summary:");
25
console.table(logger.getSessionStats());

Step 4: Detecting Infinite Loops#

Agents get stuck. It’s a fact of life. Here’s how to detect and break out.

Create `src/telemetry/loop-detector.ts`#

1
// src/telemetry/loop-detector.ts — Detect infinite loops in agent behavior
2

3
interface ToolPattern {
4
  tool: string;
5
  inputHash: string; // SHA256 of serialized input
6
  timestamp: number;
7
}
8

9
export class LoopDetector {
10
  private history: ToolPattern[] = [];
11
  private windowSize: number;
12
  private maxRepeats: number;
13

14
  /**
15
   * @param windowSize Number of recent calls to check for repeats
16
   * @param maxRepeats Max allowed repeats before triggering
17
   */
18
  constructor(windowSize: number = 10, maxRepeats: number = 3) {
19
    this.windowSize = windowSize;
20
    this.maxRepeats = maxRepeats;
21
  }
22

23
  private hashInput(input: Record<string, unknown>): string {
24
    return crypto.createHash("sha256").update(JSON.stringify(input)).digest("hex");
25
  }
26

27
  /**
28
   * Record a tool call and check if we're in a loop.
29
   * Returns true if loop detected.
30
   */
31
  record(tool: string, input: Record<string, unknown>): boolean {
32
    const pattern: ToolPattern = {
33
      tool,
34
      inputHash: this.hashInput(input),
35
      timestamp: Date.now(),
36
    };
37

38
    this.history.push(pattern);
39

40
    // Keep history to window size
41
    if (this.history.length > this.windowSize) {
42
      this.history.shift();
43
    }
44

45
    return this.checkLoop();
46
  }
47

48
  /**
49
   * Check if the same tool+input was called maxRepeats times recently.
50
   */
51
  private checkLoop(): boolean {
52
    if (this.history.length < 2) return false;
53

54
    const last = this.history[this.history.length - 1];
55
    const repeats = this.history.filter(
56
      (h) => h.tool === last.tool && h.inputHash === last.inputHash
57
    ).length;
58

59
    return repeats >= this.maxRepeats;
60
  }
61

62
  /**
63
   * Get suggested action for the detected loop.
64
   */
65
  getSuggestion(): string {
66
    if (this.history.length === 0) return "No calls yet.";
67

68
    // Find the most repeated pattern
69
    const patternCounts = new Map<string, number>();
70
    for (const h of this.history) {
71
      const key = `${h.tool}:${h.inputHash}`;
72
      patternCounts.set(key, (patternCounts.get(key) || 0) + 1);
73
    }
74

75
    // Sort by frequency
76
    const sorted = [...patternCounts.entries()].sort((a, b) => b[1] - a[1]);
77

78
    if (sorted[0][1] >= this.maxRepeats) {
79
      return `⚠️ Loop detected! Tool "${sorted[0][0].split(":")[0]}" called ${sorted[0][1]} times with same input. Consider: different search terms, breaking the task into smaller steps, or escalating to a human.`;
80
    }
81

82
    return "No loop detected.";
83
  }
84
}

Integration into the agent loop:#

1
const loopDetector = new LoopDetector(15, 4);
2

3
async function safeAgentStep(tool: string, input: Record<string, unknown>, callFn: () => Promise<any>) {
4
  if (loopDetector.record(tool, input)) {
5
    logger.warn("loop_detected", { tool, input, suggestion: loopDetector.getSuggestion() });
6
    throw new AgentLoopError(`Infinite loop detected on tool: ${tool}`);
7
  }
8
  return callFn();
9
}

Step 5: Exporting Metrics to Prometheus#

Create `src/telemetry/metrics.ts`#

1
// src/telemetry/metrics.ts — Prometheus metrics for agent runtime
2

3
export class AgentMetrics {
4
  // In-memory counters — expose via /metrics endpoint
5
  private counters: Map<string, number> = new Map();
6
  private histograms: Map<string, number[]> = new Map();
7

8
  increment(counter: string, value: number = 1) {
9
    this.counters.set(counter, (this.counters.get(counter) || 0) + value);
10
  }
11

12
  recordDuration(metric: string, durationMs: number) {
13
    if (!this.histograms.has(metric)) {
14
      this.histograms.set(metric, []);
15
    }
16
    this.histograms.get(metric)!.push(durationMs);
17

18
    // Keep only last 1000 samples
19
    const arr = this.histograms.get(metric)!;
20
    if (arr.length > 1000) arr.shift();
21
  }
22

23
  /**
24
   * Expose as Prometheus-format text.
25
   */
26
  toPrometheus(): string {
27
    const lines: string[] = [];
28

29
    lines.push("# HELP agent_tool_calls_total Total tool calls");
30
    lines.push("# TYPE agent_tool_calls_total counter");
31
    for (const [key, value] of this.counters) {
32
      if (key.startsWith("tool_call:")) {
33
        lines.push(`agent_tool_calls_total{tool="${key.replace("tool_call:", "")}"} ${value}`);
34
      }
35
    }
36

37
    // Error rate
38
    const totalCalls = [...this.counters.entries()]
39
      .filter(([k]) => k.startsWith("tool_call:"))
40
      .reduce((sum, [, v]) => sum + v, 0);
41
    const totalErrors = this.counters.get("errors") || 0;
42
    lines.push("# HELP agent_error_rate Error rate (0-1)");
43
    lines.push("# TYPE agent_error_rate gauge");
44
    lines.push(`agent_error_rate ${totalCalls > 0 ? (totalErrors / totalCalls).toFixed(4) : 0}`);
45

46
    // Duration percentiles
47
    for (const [key, values] of this.histograms) {
48
      if (values.length === 0) continue;
49
      const sorted = [...values].sort((a, b) => a - b);
50
      const p50 = sorted[Math.floor(sorted.length * 0.5)];
51
      const p95 = sorted[Math.floor(sorted.length * 0.95)];
52
      const p99 = sorted[Math.floor(sorted.length * 0.99)];
53
      lines.push(`# HELP ${key}_duration_ms Tool call duration`);
54
      lines.push(`# TYPE ${key}_duration_ms gauge`);
55
      lines.push(`${key}_duration_ms{p50="${p50}"} 1`);
56
      lines.push(`${key}_duration_ms{p95="${p95}"} 1`);
57
      lines.push(`${key}_duration_ms{p99="${p99}"} 1`);
58
    }
59

60
    // LLM cost
61
    const totalCost = this.counters.get("total_cost_usd") || 0;
62
    lines.push("# HELP agent_total_cost_usd Total LLM cost in USD");
63
    lines.push("# TYPE agent_total_cost_usd counter");
64
    lines.push(`agent_total_cost_usd ${totalCost.toFixed(6)}`);
65

66
    return lines.join("\n");
67
  }
68
}

Add a metrics endpoint to your SSE server:#

1
app.get("/metrics", (req, res) => {
2
  res.set("Content-Type", "text/plain; charset=utf-8");
3
  res.send(metrics.toPrometheus());
4
});

Step 6: Session Replay — Debug After the Fact#

When something goes wrong, you need to replay what happened. The in-memory event buffer lets you do exactly that.

Create `src/telemetry/session-store.ts`#

1
// src/telemetry/session-store.ts — Session event storage with TTL
2

3
import { AgentLogger } from "./logger.js";
4

5
interface SessionRecord {
6
  id: string;
7
  logger: AgentLogger;
8
  createdAt: number;
9
  lastActivity: number;
10
}
11

12
export class SessionStore {
13
  private sessions: Map<string, SessionRecord> = new Map();
14
  private ttlMs: number;
15
  private cleanupInterval: ReturnType<typeof setInterval>;
16

17
  constructor(ttlMs: number = 30 * 60 * 1000) { // 30 min TTL
18
    this.ttlMs = ttlMs;
19
    // Clean stale sessions every 5 minutes
20
    this.cleanupInterval = setInterval(() => this.cleanup(), 5 * 60 * 1000);
21
  }
22

23
  create(sessionId?: string): AgentLogger {
24
    const logger = new AgentLogger(sessionId);
25
    this.sessions.set(logger["sessionId"], {
26
      id: logger["sessionId"],
27
      logger,
28
      createdAt: Date.now(),
29
      lastActivity: Date.now(),
30
    });
31
    return logger;
32
  }
33

34
  get(sessionId: string): AgentLogger | undefined {
35
    const record = this.sessions.get(sessionId);
36
    if (record) {
37
      record.lastActivity = Date.now();
38
      return record.logger;
39
    }
40
    return undefined;
41
  }
42

43
  getSessionEvents(sessionId: string) {
44
    const logger = this.get(sessionId);
45
    return logger?.getEvents() || [];
46
  }
47

48
  getActiveSessions(): number {
49
    return this.sessions.size;
50
  }
51

52
  /**
53
   * Expose session data for admin dashboard.
54
   */
55
  getDashboardData() {
56
    const sessions = [...this.sessions.values()];
57
    return {
58
      activeSessions: sessions.length,
59
      sessions: sessions.map((s) => ({
60
        id: s.id,
61
        createdAt: new Date(s.createdAt).toISOString(),
62
        lastActivity: new Date(s.lastActivity).toISOString(),
63
        ageMin: ((Date.now() - s.createdAt) / 60000).toFixed(1),
64
        events: s.logger.getEvents().length,
65
        errors: s.logger.getEvents().filter((e) => e.level === "error").length,
66
        stats: s.logger.getSessionStats(),
67
      })),
68
    };
69
  }
70

71
  /**
72
   * Clean sessions that haven't had activity past TTL.
73
   */
74
  private cleanup() {
75
    const now = Date.now();
76
    for (const [id, record] of this.sessions) {
77
      if (now - record.lastActivity > this.ttlMs) {
78
        this.sessions.delete(id);
79
      }
80
    }
81
  }
82

83
  stop() {
84
    clearInterval(this.cleanupInterval);
85
  }
86
}

Step 7: Putting It All Together#

Here’s the complete integration into an SSE MCP server:

`src/server-with-telemetry.ts`#

1
import express from "express";
2
import { ToolTracer } from "./telemetry/tool-tracer.js";
3
import { AgentLogger } from "./telemetry/logger.js";
4
import { LoopDetector } from "./telemetry/loop-detector.js";
5
import { AgentMetrics } from "./telemetry/metrics.js";
6
import { SessionStore } from "./telemetry/session-store.js";
7
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
8
import { createSSEServer } from "./sse-server.js";
9
import { loadConfig } from "./env.js";
10

11
const config = loadConfig();
12
const server = new McpServer({ name: "github-issue-manager", version: "1.0.2" });
13
const metrics = new AgentMetrics();
14
const sessions = new SessionStore();
15

16
// ──── Instrumented tool registration helper ────
17

18
function instrumentedTool(
19
  name: string,
20
  description: string,
21
  schema: any,
22
  handler: (args: any) => Promise<any>
23
) {
24
  server.tool(name, description, schema, async (args) => {
25
    const logger = sessions.get("current-session") || sessions.create();
26
    const tracer = new ToolTracer(logger);
27
    const loopDetector = new LoopDetector();
28

29
    metrics.increment(`tool_call:${name}`);
30

31
    const startTime = Date.now();
32

33
    // Loop detection
34
    if (loopDetector.record(name, args)) {
35
      metrics.increment("errors");
36
      return {
37
        content: [{ type: "text", text: `⚠️ Loop detected: "${name}" was called repeatedly with the same inputs. Try a different query or break this into smaller steps.` }],
38
        isError: false,
39
      };
40
    }
41

42
    try {
43
      const { result, traceId, durationMs } = await tracer.traceToolCall(
44
        name, args, () => handler(args), { maxRetries: 2 }
45
      );
46
      metrics.recordDuration(`tool:${name}`, durationMs);
47
      return result;
48
    } catch (error) {
49
      metrics.increment("errors");
50
      metrics.increment(`error:${name}`);
51
      return {
52
        content: [{ type: "text", text: `Error calling ${name}: ${error}` }],
53
        isError: true,
54
      };
55
    }
56
  });
57
}
58

59
// Register all tools with instrumentation
60
instrumentedTool("list_issues", "...", { /*schema*/ }, async (args) => { /*handler*/ });
61
instrumentedTool("get_issue", "...", { /*schema*/ }, async (args) => { /*handler*/ });
62
// ... etc for all tools
63

64
// ──── Metrics + Admin endpoints ────
65

66
const instance = createSSEServer(server, config.port, (app) => {
67
  app.get("/metrics", (req, res) => {
68
    res.set("Content-Type", "text/plain; charset=utf-8");
69
    res.send(metrics.toPrometheus());
70
  });
71

72
  app.get("/admin/sessions", (req, res) => {
73
    res.json(sessions.getDashboardData());
74
  });
75

76
  app.get("/admin/sessions/:id", (req, res) => {
77
    const events = sessions.getSessionEvents(req.params.id);
78
    if (events.length === 0) {
79
      return res.status(404).json({ error: "Session not found" });
80
    }
81
    res.json(events);
82
  });
83
});
84

85
process.on("SIGTERM", async () => { sessions.stop(); await instance.shutdown(); });

Step 8: Running and Visualizing#

Start the instrumented server:#

1
npm run build
2
export GITHUB_TOKEN="ghp_..."
3
node build/server-with-telemetry.js

Check live metrics:#

1
curl http://localhost:3001/metrics

1
# HELP agent_tool_calls_total Total tool calls
2
# TYPE agent_tool_calls_total counter
3
agent_tool_calls_total{tool="search_issues"} 12
4
agent_tool_calls_total{tool="get_issue"} 5
5
agent_tool_calls_total{tool="create_issue"} 2
6
# HELP agent_error_rate Error rate (0-1)
7
# TYPE agent_error_rate gauge
8
agent_error_rate 0.0526
9
# HELP agent_total_cost_usd Total LLM cost in USD
10
# TYPE agent_total_cost_usd counter
11
agent_total_cost_usd 0.042310

Grafana dashboard:#

1
┌─────────────────────────────────────┐
2
│  Active Sessions: 3                 │
3
│  Total Tool Calls: 47              │
4
│  Error Rate: 5.3%                  │
5
│  Total Cost: $0.14 today           │
6
├─────────────────────────────────────┤
7
│  Tool Call Duration (p95)          │
8
│  ┌─────────────────────────────┐   │
9
│  │ search_issues: 340ms █████  │   │
10
│  │ get_issue:     210ms ████   │   │
11
│  │ create_issue:  890ms ██████ │   │
12
│  └─────────────────────────────┘   │
13
├─────────────────────────────────────┤
14
│  Recent Errors                      │
15
│  - 12:34:56 search_issues timeout   │
16
│  - 12:35:10 create_issue 403       │
17
│  - 12:36:02 loop detected (!)      │
18
└─────────────────────────────────────┘

Summary#

Concept	Implementation	Purpose
Structured logs	`AgentLogger` → JSON per event	Queryable history
Tool tracing	`ToolTracer` → duration + cost	Performance + cost
Loop detection	`LoopDetector` → pattern matching	Prevent infinite loops
Metrics	`AgentMetrics` → Prometheus format	Real-time monitoring
Session store	`SessionStore` → TTL-based	Debug after the fact
Admin API	`/metrics`, `/admin/sessions`	Dashboard integration

Production checklist:#

All tool calls wrapped with ToolTracer
Loop detection enabled (max 3-5 repeats)
Prometheus metrics exposed on /metrics
Session store TTL configured (default 30 min)
Logs forwarded to centralized aggregator
Error rate alert threshold (< 10%)
Cost tracking per session

Day	Topic
1	Observability & Telemetry ✅
2	Caching Strategies
3	Error Handling & Resilience
4	A/B Testing Prompts & Configs
5	Multi-Region & High Availability
6	Building an Internal Agent Platform

Series: AI Agents in Production. Day 1: Instrument every tool call, track costs, detect loops, and export metrics to Prometheus. Full TypeScript source code included.