PostHog × AutoResearch

// analytics for the three overnight experimentation loops — autoagent · autoresearch · gtm-autoresearch

Overview

Three autonomous loops — AutoAgent, AutoResearch, GTM-AutoResearch — run unattended, produce hundreds of rounds per weekend, and need a single pane of glass for scores, costs, and escalations. PostHog is that pane. Same events across all three; the project property tells them apart.

Treat an agent as a PostHog "user" and a round as a "session." Funnels, retention, cohorts, and experiments fall out of that framing for free.

Projects covered

project	role	guide
`autoagent`	General experimentation harness — propose, deploy, measure, keep or revert. The archetype the other two specialize.	autoagent-autoresearch-guide
`autoresearch`	Karpathy's autonomous ML experimentation loop as reusable substrate — rounds, scorers, stop conditions, fine-tune pipeline.	organized-ai-docs
`gtm-autoresearch`	AutoResearch applied to Google Tag Manager containers. 9-dimension scorer. Nightly → staging workspace + R2 versioned config + fine-tune corpus.	gtm-autoresearch-guide

One schema, three loops

┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│  autoagent  │   │autoresearch │   │   gtm-ar    │
└──────┬──────┘   └──────┬──────┘   └──────┬──────┘
       │                 │                 │
       ▼                 ▼                 ▼
       ┌─────────────────────────────────────┐
       │       observability/posthog.py      │   ← identical file in all three repos
       │  track(event, agent, round_id, …)   │
       └─────────────────┬───────────────────┘
                         ▼
                   ┌───────────┐
                   │  PostHog  │
                   └─────┬─────┘
                         ▼
    funnels · trends · cohorts · experiments · annotations

Setup

1. Install

# Python — the autoresearch orchestrator
pip install posthog

# Node — web dashboards, CF Workers
npm install posthog-node
npm install posthog-js     # browser only

2. Environment

# .env — shared across all three projects
POSTHOG_API_KEY=phc_xxx...
POSTHOG_HOST=https://us.i.posthog.com
POSTHOG_PROJECT_ID=12345

AUTORESEARCH_PROJECT=gtm-autoresearch   # or autoagent | autoresearch
AUTORESEARCH_CLIENT=acme-corp           # optional

3. Drop-in client

Identical file in all three repos at observability/posthog_client.py:

import os, time
from posthog import Posthog

ph = Posthog(os.environ["POSTHOG_API_KEY"],
             host=os.environ.get("POSTHOG_HOST", "https://us.i.posthog.com"))

PROJECT = os.environ["AUTORESEARCH_PROJECT"]
CLIENT  = os.environ.get("AUTORESEARCH_CLIENT", "internal")

def agent_id(name: str) -> str:
    return f"{PROJECT}:{CLIENT}:{name}"

def track(event, *, agent, round_id=None, **props):
    ph.capture(
        distinct_id=agent_id(agent),
        event=event,
        properties={
            "project": PROJECT, "client": CLIENT,
            "round_id": round_id, "ts": time.time(),
            **props,
        },
    )

4. Wire into the loop

from observability.posthog_client import track

def run_round(agent, round_id, proposal):
    track("round_started", agent=agent, round_id=round_id,
          model=proposal.model, diff_lines=len(proposal.diff))
    result = apply_and_score(proposal)
    track("round_scored", agent=agent, round_id=round_id,
          score=result.score, **result.dimensions)
    if result.score > baseline:
        track("round_kept", agent=agent, round_id=round_id,
              delta=result.score - baseline, cost_usd=result.cost_usd)
    else:
        track("round_reverted", agent=agent, round_id=round_id,
              reason=result.failure_mode, cost_usd=result.cost_usd)

Privacy. These events describe the agent's behavior, not end-user behavior. If a client's GTM container carries identifiers, redact in PostHog's before_send hook before it leaves your box.

Core events

Nine events. Every loop emits the same set. Per-project differences live in properties, not event names.

event	what it means	key properties
`round_started`	new round begins, before any model call	`model · diff_lines · baseline_score · n_prior_rounds`
`proposal_generated`	agent produced a candidate change	`prompt_tokens · completion_tokens · cost_usd · latency_ms`
`proposal_applied`	diff landed in staging (GTM workspace, branch, etc.)	`target · diff_bytes · apply_latency_ms`
`round_scored`	scorer returned a verdict — GTM includes 9 dim_* props	`score · dim_coverage · dim_correctness · dim_resilience · …`
`round_kept`	score beat baseline; change promoted	`delta · new_best · cost_usd`
`round_reverted`	score missed; change rolled back	`failure_mode · cost_usd · regressed_dimensions[]`
`model_escalated`	cheap model stalled; bumped tier	`from_model · to_model · stall_rounds · reason`
`stop_triggered`	max rounds, budget, plateau, kill switch	`reason · rounds_completed · total_cost_usd · final_best_score`
`finetune_batch_published`	kept rounds rolled into training batch	`batch_id · n_examples · r2_key · target_model`

Property envelope (always present)

property	purpose
`project`	separates autoagent · autoresearch · gtm-autoresearch
`client`	multi-tenant filter
`run_id`	ULID per nightly run — group rounds into batches
`round_id`	ULID per round — join events across a round's lifecycle
`agent`	which agent role emitted it (planner · critic · coder · …)
`git_sha`	short SHA of the loop code — correlate score drops to regressions

Dashboards

Six boards that answer the six questions you'll ask every morning.

board	insight type	answers
Round Funnel	Funnel: started → generated → applied → scored → kept	where rounds drop off
Score Trend	Line: `max(round_scored.score)` per day, by `project`	is each night improving on the last?
Dimension Heatmap (GTM)	Stickiness: avg of each `dim_*` on `round_scored`	which dims chronically gate the score?
Cost / Kept	Formula: `sum(cost_usd) / count(round_kept)` per day	dollars per kept config — trending?
Escalation Chart	Line: `model_escalated` count by `to_model`	escalating too eagerly? too late?
Stop-Reason Mix	Breakdown: `stop_triggered.reason` over time	budget-capped · plateau · round cap?

HogQL — score trend sketch

SELECT
  toDate(timestamp) AS day,
  properties.project AS project,
  max(toFloat(properties.score)) AS best_score,
  count() AS n_rounds
FROM events
WHERE event = 'round_scored'
  AND timestamp > now() - INTERVAL 30 DAY
GROUP BY day, project
ORDER BY day DESC

LLM queries — every prompt & response

Wrap the model client once. Every call becomes a $ai_generation event with prompt, completion, token counts, cost in USD, latency, and any properties you attach — correlated to the parent round_id automatically.

from posthog.ai.anthropic import Anthropic

client = Anthropic(posthog_client=ph)   # drop-in replacement

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=2048,
    messages=[{"role": "user", "content": proposal_prompt}],
    posthog_distinct_id=agent_id("planner"),
    posthog_properties={
        "round_id": round_id,
        "project": PROJECT,
        "phase": "proposal",          # proposal | critique | score | refine
        "task_type": "gtm.rewrite",   # matches token-machine taxonomy
    },
)

What you get for free

captured property	shape	why it matters
`$ai_input`	full prompt messages array	inspect any bad round: what did the planner actually see?
`$ai_output_choices`	completion text(s)	sanity-check refusals, JSON corruption, truncation
`$ai_input_tokens` / `$ai_output_tokens`	integers	per-request cost math and prompt-bloat detection
`$ai_total_cost_usd`	float	roll up to any cohort (client, task_type, phase)
`$ai_latency`	float seconds	find the p95 offenders — usually long-context prompts
`$ai_model` / `$ai_provider`	string	compare models apples-to-apples on the same task
`$ai_is_error`	boolean	rate-limit, timeout, and schema-mismatch surface as first-class events

Redaction at the boundary. Pass posthog_privacy_mode=True to capture token counts + cost without the prompt text itself. For selective redaction, strip fields in your own wrapper before handing the client to the loop — PostHog never sees what you don't send.

Multi-provider? Same pattern.

from posthog.ai.openai    import OpenAI
from posthog.ai.anthropic import Anthropic
from posthog.ai.gemini    import Client as Gemini
from posthog.ai.langchain import CallbackHandler   # for chains/agents

All of them emit $ai_generation with the same envelope — so a model swap is a one-line change and the dashboards don't move.

Logs — structured agent activity

A round isn't just "generate → score." It's a sequence of steps the agent takes: plan, tool call, observation, reflection, retry. Each step is a log event. When you can replay a round step-by-step in PostHog's activity view, debugging becomes a scroll instead of a stare at a stdout dump.

The six step events

event	when fired	key properties
`step.plan`	agent decides what to do next	`plan · next_action · confidence`
`step.tool_called`	tool / function invocation	`tool · args_json · timeout_ms`
`step.tool_returned`	tool result back	`tool · ok · latency_ms · result_bytes`
`step.observation`	external state sampled (metric, log tail)	`source · value · delta_from_last`
`step.reflection`	agent critiques own output	`verdict · issues[] · retry`
`step.retry`	agent retries after failure	`cause · attempt · prior_error`

def step_plan(agent, round_id, plan, next_action):
    track("step.plan", agent=agent, round_id=round_id,
          step_id=ulid(), plan=plan[:500], next_action=next_action)

def step_tool(agent, round_id, tool, args, timeout_ms):
    step_id = ulid()
    track("step.tool_called", agent=agent, round_id=round_id,
          step_id=step_id, tool=tool, args_json=json.dumps(args)[:2000],
          timeout_ms=timeout_ms)
    t0 = time.time()
    try:
        result = TOOLS[tool](**args)
        track("step.tool_returned", agent=agent, round_id=round_id,
              step_id=step_id, tool=tool, ok=True,
              latency_ms=int((time.time()-t0)*1000),
              result_bytes=len(str(result)))
        return result
    except Exception as e:
        track("step.tool_returned", agent=agent, round_id=round_id,
              step_id=step_id, tool=tool, ok=False, error=str(e)[:400],
              latency_ms=int((time.time()-t0)*1000))
        raise

Step ULIDs matter. Every step gets a step_id ULID. Pair step.tool_called with its matching step.tool_returned via that id — otherwise a retried or re-ordered tool call will corrupt your funnel.

Evals — capture the scoring process, not just the score

The GTM 9-dimension scorer (and whatever AutoAgent/AutoResearch grow into) is itself an LLM-powered program. If you only capture the final score, you lose the why — and the why is what you use to improve the scorer.

Three eval events

event	meaning	properties
`eval.run`	a full eval suite started on a candidate	`suite · n_cases · scorer_version`
`eval.case`	one case within the suite — per-dimension detail	`case_id · dim · score · rationale · passed`
`eval.regression`	a previously-passing case now fails	`case_id · dim · prior_score · new_score · delta`

def run_eval_suite(agent, round_id, candidate):
    suite_id = ulid()
    track("eval.run", agent=agent, round_id=round_id,
          suite_id=suite_id, suite="gtm-9dim", scorer_version="v2.3",
          n_cases=9)

    scores = {}
    for dim in GTM_DIMENSIONS:
        verdict = llm_judge(candidate, rubric=dim.rubric)   # also traced as $ai_generation
        scores[dim.key] = verdict.score
        track("eval.case", agent=agent, round_id=round_id,
              suite_id=suite_id, case_id=dim.key, dim=dim.key,
              score=verdict.score, rationale=verdict.rationale[:800],
              passed=verdict.score >= dim.threshold)

        if dim.key in prior_best and verdict.score < prior_best[dim.key] - 0.05:
            track("eval.regression", agent=agent, round_id=round_id,
                  case_id=dim.key, dim=dim.key,
                  prior_score=prior_best[dim.key], new_score=verdict.score,
                  delta=verdict.score - prior_best[dim.key])

    return scores

Insights this unlocks

Dimension heatmap — which of the 9 eval.case.dims is lowest on average, by client?
Scorer-version A/B — ship scorer_version=v2.4 to half of rounds via feature flag; compare round_scored.score distributions.
Rationale search — PostHog's activity view is searchable; grep across rationales for "missing consent" or "unbounded dataLayer" to see which rubric the scorer is over-firing on.
Regression alerts — PostHog alert on eval.regression count > 0 per run. Page yourself when a prompt tweak breaks a previously-passing case.

Traces — one round, end to end

A trace is the story of a round from round_started to round_kept (or round_reverted), with every sub-step and every LLM call threaded under it. PostHog's LLM obs auto-creates traces when events share a $ai_trace_id; we use round_id as that trace id so everything in a round lives under one timeline.

round_id = 01HXYZ...              ←── trace root
│
├─ round_started                   (t=0ms)
├─ step.plan                       (t=12ms)
├─ $ai_generation  planner         (t=20ms,  842ms, $0.004)
├─ step.tool_called  apply_diff    (t=880ms)
├─ step.tool_returned  apply_diff  (t=1.2s)
├─ eval.run                        (t=1.2s, suite=gtm-9dim)
│   ├─ $ai_generation  judge:dim0  (t=1.3s,  410ms)
│   ├─ eval.case  dim0             (t=1.7s, passed)
│   ├─ $ai_generation  judge:dim1  (t=1.7s,  395ms)
│   └─ eval.case  dim1             (t=2.1s, passed)
├─ round_scored                    (t=8.4s, score=0.78)
└─ round_kept                      (t=8.5s, delta=+0.06, cost=$0.042)

# Set once per round — every generation + event inside inherits it.
ph.set_context({"$ai_trace_id": round_id})

# or pass explicitly on each LLM call
client.messages.create(
    model="claude-sonnet-4-5",
    messages=[...],
    posthog_distinct_id=agent_id("judge"),
    posthog_properties={"$ai_trace_id": round_id, "$ai_span_id": "judge.dim0"},
)

In PostHog's LLM Observability → Traces view, click the trace for round_id 01HXYZ... and you see the full tree above, with every prompt, completion, tool call, and eval-case side-by-side, ordered by timestamp. That is what replayability looks like.

Trace durability. A trace works even if events arrive out of order or across multiple processes — the orchestrator on claw, the worker on mbp, and the browser on jordan can all stamp the same $ai_trace_id and PostHog stitches them. Use ULIDs for round_id: time-sortable, unique across machines, no coordination needed.

Bridge — Token Machine

Token Machine sits in front of every model call the team makes, routes by task-type, grades users, and already emits three PostHog events of its own. The autoresearch loops push into the same PostHog project, so Token Machine's efficiency view and the autoresearch dashboards correlate through shared properties.

Token Machine emits

posthog.capture('token_machine.request', {
  user_id, task_type, model_used,
  input_tokens, output_tokens, cost_usd,
  quality_score, efficiency_grade, latency_ms,
})

posthog.capture('token_machine.team_summary', {
  total_requests, total_cost, avg_quality,
  worst_performer, best_performer,
  escalation_candidates: ['task_type_a', 'task_type_b'],
})

posthog.capture('token_machine.anomaly', {
  type, user_id, task_type, recommendation,
})

Shared properties — the correlation seam

property	autoresearch uses	token-machine uses
`task_type`	`gtm.rewrite` · `ar.propose` · `aa.refine`	same taxonomy — don't fork
`user_id` / `distinct_id`	`{project}:{client}:{agent}`	same id — agents grade the same as humans
`model_used` / `$ai_model`	PostHog LLM obs auto-populates `$ai_model`	TM writes `model_used`; alias them in a dashboard
`cost_usd` / `$ai_total_cost_usd`	from `$ai_generation`	from `token_machine.request`
`quality_score`	from `eval.run` avg	from TM's grader — feed TM's score back as a property on `round_scored`

Three joined boards

Model-fit by task_type — avg quality_score broken down by task_type × model_used, across both token_machine.request and round_scored. Tells you which local OpenClaw endpoint is genuinely replacing Claude and which is faking it.
Cost compression trend — line: weekly sum(cost_usd) by task_type, with an annotation every time TM publishes a new routing rule. Is the autoresearch-driven routing cutting cost over time?
Anomaly → round regression — join token_machine.anomaly.type='wrong_model' to round_reverted.failure_mode within the same hour. Does a TM misrouting cause downstream revert cascades?

Don't double-count cost. PostHog LLM obs and Token Machine both record cost_usd for the same call. Pick one source of truth per board and filter the other out — usually TM for routed calls, LLM obs for direct calls.

Feature flags & experiments

Flags turn autoresearch behaviors into controllable knobs. Which scorer version tonight? Escalate after 3 stalls or 5? Meta-experiments on the experimenter — perfect for flags that the loop reads at round start.

if ph.feature_enabled("scorer_v2", distinct_id=agent_id("planner")):
    score = score_v2(config)
else:
    score = score_v1(config)

track("round_scored", agent="planner", round_id=round_id,
      score=score, scorer_variant="v2" if v2 else "v1")

Launch a formal experiment with scorer_v2 as the flag and max(round_scored.score) as the metric. After a week PostHog tells you which variant won with confidence intervals — same stats engine SaaS products use to pick a pricing page, applied to which scorer your agent uses.

Useful flags

scorer_v2 — rubric upgrade
aggressive_escalation — escalate after 2 stalls instead of 5
diff_size_cap_32 — reject proposals > 32 lines
nightly_budget_20usd — hard spend cap per run
gtm_dim_weights_v3 — new dimension weighting

Annotations

Every deploy, prompt rewrite, and config change gets a vertical line on every chart. Wire a git post-commit hook to PostHog's annotations API and "did that prompt rewrite on Thursday hurt the score?" becomes obvious at a glance.

# post-commit hook → PostHog annotation
curl -X POST https://us.i.posthog.com/api/projects/$POSTHOG_PROJECT_ID/annotations/ \
  -H "Authorization: Bearer $POSTHOG_PERSONAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "planner prompt: tightened role spec, added scoring rubric in-context",
    "date_marker": "2026-04-19T03:00:00Z",
    "scope": "project"
  }'

Ship it

One PostHog project. All three loops write to it; project property separates them.
Drop observability/posthog_client.py into each repo. Identical file; zero drift.
Wrap the model client — one line change, automatic generation tracking.
Build Round Funnel + Score Trend first. These two are 90% of the daily value.
Wire the git annotation hook so every prompt tweak shows up as a vertical line everywhere.

First-week win. Before anything fancy, just the Round Funnel + Score Trend boards will tell you (a) which loop step is leaking rounds and (b) whether last night improved on the night before. That alone is worth the 30 minutes.

Pitfalls

Don't use the agent name as distinct_id raw. Prefix with {project}:{client}: or every client's "planner" merges into one PostHog user.
Cost properties must be floats. PostHog aggregates only work on numeric types — cost_usd=0.042, not "$0.042".
Don't send prompts verbatim if the client's data leaks in. Use PostHog LLM obs's redaction hook or truncate to first 500 chars.
Batch in the loop, flush on exit. ph.shutdown() at the end of a nightly run — otherwise queued events evaporate on process exit.
9 dimensions × 100 rounds × 3 projects = thousands of properties per night. Don't stringify; ship each dim_* as its own float property so breakdowns work.
Annotations are global by default. Scope to project unless the prompt change actually affected all three loops.