Observability for AI-Powered Products

The three pillars of observability — metrics, traces, and logs — were designed for deterministic systems. You send a request, you get a response, you measure latency, error rate, and throughput. The system either worked or it didn't.

AI-powered products introduce a fourth dimension: quality. The system can be fast, available, and error-free while still producing outputs that are wrong, inconsistent, or subtly degraded. Traditional observability instruments don't capture this.

Here's how to instrument AI systems properly.

The Standard Layer Still Applies

Before the AI-specific instrumentation, get the basics right:

// Trace every LLM call as a span
const span = tracer.startSpan("llm.completion", {
  attributes: {
    "llm.provider": "openai",
    "llm.model": "gpt-4o",
    "llm.feature": "chat-response",
    "tenant.id": tenantId,
  },
});

try {
  const response = await openai.chat.completions.create(params);

  span.setAttributes({
    "llm.input_tokens": response.usage.prompt_tokens,
    "llm.output_tokens": response.usage.completion_tokens,
    "llm.total_tokens": response.usage.total_tokens,
    "llm.finish_reason": response.choices[0].finish_reason,
  });

  return response;
} finally {
  span.end();
}

This gives you latency, token usage, and finish reason for every call. This is the floor, not the ceiling.

Token Budget Telemetry

Token usage is simultaneously a cost signal, a quality signal, and a security signal.

// Track token distributions over time
metrics.histogram("llm.tokens.input", {
  value: inputTokens,
  labels: { model, feature, tenantId },
});

metrics.histogram("llm.tokens.output", {
  value: outputTokens,
  labels: { model, feature },
});

// Alert when token usage spikes abnormally
if (inputTokens > BUDGET_WARNING_THRESHOLD) {
  logger.warn("Token budget warning", {
    inputTokens,
    threshold: BUDGET_WARNING_THRESHOLD,
    userId,
    requestId,
  });
}

Anomalous token usage can indicate: prompt injection attacks (inflating context), agentic loops (runaway tool calls), or user behavior changes (longer queries).

Semantic Quality Signals

This is where AI observability diverges from standard observability.

For tasks with ground truth, track accuracy directly. For tasks without ground truth (open-ended generation), you need proxy signals:

User feedback signals:

Thumbs up/down ratings
Correction rate (user edited AI output)
Regeneration rate (user asked for a different response)
Escalation rate (user abandoned AI to contact human support)

Structural signals:

Response length distribution (are responses getting shorter/longer?)
Refusal rate (how often does the model decline to answer?)
Confidence scores (where available)
Finish reason distribution (stop vs length vs content_filter)

// Record user feedback events
await analytics.track("ai.feedback", {
  requestId,
  featureId,
  tenantId,
  sentiment: "negative", // 'positive' | 'negative' | 'neutral'
  feedbackType: "regeneration", // 'thumbs_down' | 'regeneration' | 'correction' | 'escalation'
  modelVersion: currentModelVersion,
  promptHash: hashPromptTemplate(systemPrompt),
});

Prompt Template Versioning

When a provider updates a model, your prompts may behave differently. When you update a prompt, outputs change. You need to track both.

// Version your prompt templates
const PROMPT_VERSION = "chat-v2.3.1";

// Include version in telemetry
span.setAttribute("llm.prompt_template_version", PROMPT_VERSION);
span.setAttribute("llm.model_version", modelVersion); // if deterministic

Correlating prompt version + model version with quality metrics lets you:

Detect when a silent model update degraded outputs
A/B test prompt changes with measurable quality impact
Roll back prompt changes when metrics regress

Distributed Tracing Across the AI Pipeline

Agentic workflows span multiple LLM calls, tool invocations, and external system calls. Trace them end-to-end.

// Parent span for the full agentic workflow
const workflowSpan = tracer.startSpan("agent.workflow", {
  attributes: { "workflow.type": "document-analysis", userId },
});

// Child spans for each step
const retrievalSpan = tracer.startSpan("rag.retrieval", {
  parent: workflowSpan,
});
// ... vector search ...
retrievalSpan.end();

const llmSpan = tracer.startSpan("llm.completion", { parent: workflowSpan });
// ... LLM call with retrieved context ...
llmSpan.end();

const toolSpan = tracer.startSpan("agent.tool.execution", {
  parent: workflowSpan,
});
// ... tool call ...
toolSpan.setAttribute("tool.name", "web_search");
toolSpan.setAttribute("tool.safe", true);
toolSpan.end();

workflowSpan.end();

This creates a complete trace of the AI decision chain — essential for debugging unexpected behavior and auditing AI actions.

Cost Attribution

For multi-tenant products, you need per-tenant cost attribution:

// Track costs per tenant per feature per model
await costTracker.record({
  tenantId,
  featureId: "ai-chat",
  model: "gpt-4o",
  inputTokens,
  outputTokens,
  estimatedCost: calculateCost(inputTokens, outputTokens, "gpt-4o"),
  timestamp: Date.now(),
});

Store this in ClickHouse or a time-series database. Query it for:

Per-tenant monthly cost reports
Feature profitability analysis
Budget alerts
Abuse detection (unusually high cost per user)

The Dashboard You Actually Need

A useful AI observability dashboard has:

Latency percentiles (P50, P95, P99) by model and feature
Token usage over time with anomaly bands
Cost per feature with trend lines
Quality proxy metrics (feedback rates, regeneration rates)
Error rate and finish reason breakdown
Prompt version performance comparison
Per-tenant cost heatmap

Without this, you're operating AI in production while blind. With it, you have the feedback loops necessary to improve quality, control costs, and catch issues before they affect users.

Build the instrumentation first. You'll need it.