Why Metadata-First Observability is the Right Approach — FORG Blog

The False Premise of Payload Logging

When most engineering teams first think about observing their AI usage, the instinct is to log everything: prompts, completions, tool calls, the whole conversation. The reasoning is intuitive — if something goes wrong, you want a full replay. If costs spike, you want to know why. If a developer says something they shouldn't, you want a record.

This reasoning has a fundamental problem: the things you actually need to know for operations, cost management, and compliance don't require payload content. And the things that payload content enables — full replays, content auditing — create legal, privacy, and security liabilities that far outweigh their operational value.

What Metadata Actually Tells You

Let's be concrete. Here's what's in a FORG signal:

{
  "ts": 1716840000000,          // When
  "session_id": "sess_01hwx",   // Which session
  "adapter": "claude-code",     // Which tool
  "model": "claude-sonnet-4-5", // Which model
  "tokens": {
    "input": 2847,              // How much context
    "output": 412,              // How much output
    "cache_read": 1200,         // How much was cached
    "cache_write": 0
  },
  "cost_usd": 0.00892,         // What it cost
  "latency_ms": {
    "ttft": 312,                // How fast first token
    "total": 1847               // How long total
  },
  "dimensions": {
    "user": "alice@co.com",     // Who
    "project": "backend-api",   // What project
    "environment": "dev"        // What environment
  }
}

With just this data — no prompt, no completion — you can answer every operationally relevant question:

How much is our AI tooling costing us this month? (cost_usd × count)
Which developer is spending the most? (group by user)
Which model are we using most? (group by model)
Are we using prompt caching effectively? (cache_read ratio)
Is latency degrading? (latency_ms over time)
Which projects are AI-intensive? (group by project)
Is a developer hammering the API? (session count × call frequency)
Are budget rules working? (cost_usd by user vs. rule limit)

You can also answer the compliance questions that enterprise customers care about:

Can you prove no AI calls were made in production with sensitive data? (environment dimension)
Do you have an audit trail of all AI usage? (session_id + ts + user)
Can you generate a GDPR data export for a departing employee? (filter by user)
Can you demonstrate policy compliance to your SOC 2 auditor? (rule enforcement log)

What Payload Storage Gets Wrong

The moment you store prompt content, you've created a liability. Consider what developers actually put in prompts when working on code:

API keys and secrets that are in context (even accidentally)
Internal architecture decisions and system designs
Customer data that ended up in a debugging session
PII from database queries run as context
Proprietary algorithms and business logic

If you're storing all of that, you've now created a high-value exfiltration target. A breach of your observability system becomes a breach of your entire codebase, your architecture, and potentially your customer data. The blast radius is enormous.

Beyond the security risk, there's the privacy risk. Under GDPR, storing personal data in conversation logs creates obligations around retention, right-to-deletion, and data minimization that are genuinely difficult to satisfy. Under HIPAA, if any PHI ever appears in a prompt — even if it wasn't supposed to — your payload store is now a covered system with BAA requirements.

The k-Anonymity Guarantee

FORG enforces k-anonymity ≥ 5 across all usage reports. This means no individual developer's usage is ever surfaced unless they're part of a group of at least 5 users with similar behavior patterns. Usage that would identify a specific individual is automatically aggregated until it meets the threshold.

In practice: if you have 3 developers, FORG won't show per-developer breakdowns. It will show team-level aggregates. Once you have 5+ developers, individual breakdowns become available — but any pattern that only appears in one or two users' data is suppressed.

This has a practical consequence: you can't use FORG to surveil individual developers. You can see if the team is over budget; you can't see that Alice is making 200 API calls a day unless there are at least 4 other developers doing the same. For teams under 5, all data is aggregated at the org level.

The Architecture: Where Does Data Live?

FORG's data architecture is designed around minimization. Signals flow from the agent binary to the Rule Engine Worker over HTTPS with mutual TLS. The Rule Engine writes to Supabase with pgvector, using RLS (Row Level Security) to isolate tenants.

Encryption at rest uses AES-256-GCM (Supabase default). Encryption in transit is TLS 1.3. No plaintext ever touches disk on our infrastructure.

For Business+ customers, you can choose data residency: US (default) or EU. The EU residency option routes your signals to EU-region Cloudflare Workers and an EU Supabase instance, with no cross-region data transfer.

What About Debugging?

The most common objection to metadata-only is debugging: "How do I know what caused a cost spike if I don't have the actual prompt?"

In practice, metadata is sufficient for 95% of debugging scenarios:

Cost spike? The signal shows you which model, which user, which project, and the token counts. You know it was 4,000-token context from the backend-api project, probably the CI pipeline. You don't need the prompt.
Latency regression? TTFT and total latency by model by hour. You can see exactly when it started degrading and correlate with deployments.
Unexpected API usage? The session count and call frequency show you who is making calls and when. You can ask them.

For the 5% of cases where you genuinely need the prompt for debugging, the developer who made the call has it locally — in their Claude Code history, in their IDE logs, on their machine. It never needed to be on your servers.

The Design Principle

The principle behind metadata-first observability is that a monitoring system should be able to answer every question that's relevant to operating the system it monitors — without containing enough information to become a liability if breached.

Your observability platform shouldn't be a backup copy of your entire codebase and conversation history. It should be a compact, structured record of what happened, when, at what cost, and with what result — enough to understand and govern the system, not enough to recreate it.

That's what FORG stores. That's all FORG stores. And we think it's enough to build genuinely useful AI governance on top of.