PostHog × AutoResearch

// analytics for the three overnight experimentation loops — autoagent · autoresearch · gtm-autoresearch

Overview

Three autonomous loops — AutoAgent, AutoResearch, GTM-AutoResearch — run unattended, produce hundreds of rounds per weekend, and need a single pane of glass for scores, costs, and escalations. PostHog is that pane. Same events across all three; the project property tells them apart.

Treat an agent as a PostHog "user" and a round as a "session." Funnels, retention, cohorts, and experiments fall out of that framing for free.

Projects covered

projectroleguide
autoagentGeneral experimentation harness — propose, deploy, measure, keep or revert. The archetype the other two specialize.autoagent-autoresearch-guide
autoresearchKarpathy's autonomous ML experimentation loop as reusable substrate — rounds, scorers, stop conditions, fine-tune pipeline.organized-ai-docs
gtm-autoresearchAutoResearch applied to Google Tag Manager containers. 9-dimension scorer. Nightly → staging workspace + R2 versioned config + fine-tune corpus.gtm-autoresearch-guide

One schema, three loops

┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│  autoagent  │   │autoresearch │   │   gtm-ar    │
└──────┬──────┘   └──────┬──────┘   └──────┬──────┘
       │                 │                 │
       ▼                 ▼                 ▼
       ┌─────────────────────────────────────┐
       │       observability/posthog.py      │   ← identical file in all three repos
       │  track(event, agent, round_id, …)   │
       └─────────────────┬───────────────────┘
                         ▼
                   ┌───────────┐
                   │  PostHog  │
                   └─────┬─────┘
                         ▼
    funnels · trends · cohorts · experiments · annotations

Setup

1. Install

# Python — the autoresearch orchestrator
pip install posthog

# Node — web dashboards, CF Workers
npm install posthog-node
npm install posthog-js     # browser only

2. Environment

# .env — shared across all three projects
POSTHOG_API_KEY=phc_xxx...
POSTHOG_HOST=https://us.i.posthog.com
POSTHOG_PROJECT_ID=12345

AUTORESEARCH_PROJECT=gtm-autoresearch   # or autoagent | autoresearch
AUTORESEARCH_CLIENT=acme-corp           # optional

3. Drop-in client

Identical file in all three repos at observability/posthog_client.py:

import os, time
from posthog import Posthog

ph = Posthog(os.environ["POSTHOG_API_KEY"],
             host=os.environ.get("POSTHOG_HOST", "https://us.i.posthog.com"))

PROJECT = os.environ["AUTORESEARCH_PROJECT"]
CLIENT  = os.environ.get("AUTORESEARCH_CLIENT", "internal")

def agent_id(name: str) -> str:
    return f"{PROJECT}:{CLIENT}:{name}"

def track(event, *, agent, round_id=None, **props):
    ph.capture(
        distinct_id=agent_id(agent),
        event=event,
        properties={
            "project": PROJECT, "client": CLIENT,
            "round_id": round_id, "ts": time.time(),
            **props,
        },
    )

4. Wire into the loop

from observability.posthog_client import track

def run_round(agent, round_id, proposal):
    track("round_started", agent=agent, round_id=round_id,
          model=proposal.model, diff_lines=len(proposal.diff))
    result = apply_and_score(proposal)
    track("round_scored", agent=agent, round_id=round_id,
          score=result.score, **result.dimensions)
    if result.score > baseline:
        track("round_kept", agent=agent, round_id=round_id,
              delta=result.score - baseline, cost_usd=result.cost_usd)
    else:
        track("round_reverted", agent=agent, round_id=round_id,
              reason=result.failure_mode, cost_usd=result.cost_usd)

Privacy. These events describe the agent's behavior, not end-user behavior. If a client's GTM container carries identifiers, redact in PostHog's before_send hook before it leaves your box.

Core events

Nine events. Every loop emits the same set. Per-project differences live in properties, not event names.

eventwhat it meanskey properties
round_startednew round begins, before any model callmodel · diff_lines · baseline_score · n_prior_rounds
proposal_generatedagent produced a candidate changeprompt_tokens · completion_tokens · cost_usd · latency_ms
proposal_applieddiff landed in staging (GTM workspace, branch, etc.)target · diff_bytes · apply_latency_ms
round_scoredscorer returned a verdict — GTM includes 9 dim_* propsscore · dim_coverage · dim_correctness · dim_resilience · …
round_keptscore beat baseline; change promoteddelta · new_best · cost_usd
round_revertedscore missed; change rolled backfailure_mode · cost_usd · regressed_dimensions[]
model_escalatedcheap model stalled; bumped tierfrom_model · to_model · stall_rounds · reason
stop_triggeredmax rounds, budget, plateau, kill switchreason · rounds_completed · total_cost_usd · final_best_score
finetune_batch_publishedkept rounds rolled into training batchbatch_id · n_examples · r2_key · target_model

Property envelope (always present)

propertypurpose
projectseparates autoagent · autoresearch · gtm-autoresearch
clientmulti-tenant filter
run_idULID per nightly run — group rounds into batches
round_idULID per round — join events across a round's lifecycle
agentwhich agent role emitted it (planner · critic · coder · …)
git_shashort SHA of the loop code — correlate score drops to regressions

Dashboards

Six boards that answer the six questions you'll ask every morning.

boardinsight typeanswers
Round FunnelFunnel: started → generated → applied → scored → keptwhere rounds drop off
Score TrendLine: max(round_scored.score) per day, by projectis each night improving on the last?
Dimension Heatmap (GTM)Stickiness: avg of each dim_* on round_scoredwhich dims chronically gate the score?
Cost / KeptFormula: sum(cost_usd) / count(round_kept) per daydollars per kept config — trending?
Escalation ChartLine: model_escalated count by to_modelescalating too eagerly? too late?
Stop-Reason MixBreakdown: stop_triggered.reason over timebudget-capped · plateau · round cap?

HogQL — score trend sketch

SELECT
  toDate(timestamp) AS day,
  properties.project AS project,
  max(toFloat(properties.score)) AS best_score,
  count() AS n_rounds
FROM events
WHERE event = 'round_scored'
  AND timestamp > now() - INTERVAL 30 DAY
GROUP BY day, project
ORDER BY day DESC

LLM queries — every prompt & response

Wrap the model client once. Every call becomes a $ai_generation event with prompt, completion, token counts, cost in USD, latency, and any properties you attach — correlated to the parent round_id automatically.

from posthog.ai.anthropic import Anthropic

client = Anthropic(posthog_client=ph)   # drop-in replacement

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=2048,
    messages=[{"role": "user", "content": proposal_prompt}],
    posthog_distinct_id=agent_id("planner"),
    posthog_properties={
        "round_id": round_id,
        "project": PROJECT,
        "phase": "proposal",          # proposal | critique | score | refine
        "task_type": "gtm.rewrite",   # matches token-machine taxonomy
    },
)

What you get for free

captured propertyshapewhy it matters
$ai_inputfull prompt messages arrayinspect any bad round: what did the planner actually see?
$ai_output_choicescompletion text(s)sanity-check refusals, JSON corruption, truncation
$ai_input_tokens / $ai_output_tokensintegersper-request cost math and prompt-bloat detection
$ai_total_cost_usdfloatroll up to any cohort (client, task_type, phase)
$ai_latencyfloat secondsfind the p95 offenders — usually long-context prompts
$ai_model / $ai_providerstringcompare models apples-to-apples on the same task
$ai_is_errorbooleanrate-limit, timeout, and schema-mismatch surface as first-class events

Redaction at the boundary. Pass posthog_privacy_mode=True to capture token counts + cost without the prompt text itself. For selective redaction, strip fields in your own wrapper before handing the client to the loop — PostHog never sees what you don't send.

Multi-provider? Same pattern.

from posthog.ai.openai    import OpenAI
from posthog.ai.anthropic import Anthropic
from posthog.ai.gemini    import Client as Gemini
from posthog.ai.langchain import CallbackHandler   # for chains/agents

All of them emit $ai_generation with the same envelope — so a model swap is a one-line change and the dashboards don't move.

Logs — structured agent activity

A round isn't just "generate → score." It's a sequence of steps the agent takes: plan, tool call, observation, reflection, retry. Each step is a log event. When you can replay a round step-by-step in PostHog's activity view, debugging becomes a scroll instead of a stare at a stdout dump.

The six step events

eventwhen firedkey properties
step.planagent decides what to do nextplan · next_action · confidence
step.tool_calledtool / function invocationtool · args_json · timeout_ms
step.tool_returnedtool result backtool · ok · latency_ms · result_bytes
step.observationexternal state sampled (metric, log tail)source · value · delta_from_last
step.reflectionagent critiques own outputverdict · issues[] · retry
step.retryagent retries after failurecause · attempt · prior_error
def step_plan(agent, round_id, plan, next_action):
    track("step.plan", agent=agent, round_id=round_id,
          step_id=ulid(), plan=plan[:500], next_action=next_action)

def step_tool(agent, round_id, tool, args, timeout_ms):
    step_id = ulid()
    track("step.tool_called", agent=agent, round_id=round_id,
          step_id=step_id, tool=tool, args_json=json.dumps(args)[:2000],
          timeout_ms=timeout_ms)
    t0 = time.time()
    try:
        result = TOOLS[tool](**args)
        track("step.tool_returned", agent=agent, round_id=round_id,
              step_id=step_id, tool=tool, ok=True,
              latency_ms=int((time.time()-t0)*1000),
              result_bytes=len(str(result)))
        return result
    except Exception as e:
        track("step.tool_returned", agent=agent, round_id=round_id,
              step_id=step_id, tool=tool, ok=False, error=str(e)[:400],
              latency_ms=int((time.time()-t0)*1000))
        raise

Step ULIDs matter. Every step gets a step_id ULID. Pair step.tool_called with its matching step.tool_returned via that id — otherwise a retried or re-ordered tool call will corrupt your funnel.

Evals — capture the scoring process, not just the score

The GTM 9-dimension scorer (and whatever AutoAgent/AutoResearch grow into) is itself an LLM-powered program. If you only capture the final score, you lose the why — and the why is what you use to improve the scorer.

Three eval events

eventmeaningproperties
eval.runa full eval suite started on a candidatesuite · n_cases · scorer_version
eval.caseone case within the suite — per-dimension detailcase_id · dim · score · rationale · passed
eval.regressiona previously-passing case now failscase_id · dim · prior_score · new_score · delta
def run_eval_suite(agent, round_id, candidate):
    suite_id = ulid()
    track("eval.run", agent=agent, round_id=round_id,
          suite_id=suite_id, suite="gtm-9dim", scorer_version="v2.3",
          n_cases=9)

    scores = {}
    for dim in GTM_DIMENSIONS:
        verdict = llm_judge(candidate, rubric=dim.rubric)   # also traced as $ai_generation
        scores[dim.key] = verdict.score
        track("eval.case", agent=agent, round_id=round_id,
              suite_id=suite_id, case_id=dim.key, dim=dim.key,
              score=verdict.score, rationale=verdict.rationale[:800],
              passed=verdict.score >= dim.threshold)

        if dim.key in prior_best and verdict.score < prior_best[dim.key] - 0.05:
            track("eval.regression", agent=agent, round_id=round_id,
                  case_id=dim.key, dim=dim.key,
                  prior_score=prior_best[dim.key], new_score=verdict.score,
                  delta=verdict.score - prior_best[dim.key])

    return scores

Insights this unlocks

Traces — one round, end to end

A trace is the story of a round from round_started to round_kept (or round_reverted), with every sub-step and every LLM call threaded under it. PostHog's LLM obs auto-creates traces when events share a $ai_trace_id; we use round_id as that trace id so everything in a round lives under one timeline.

round_id = 01HXYZ...              ←── trace root
│
├─ round_started                   (t=0ms)
├─ step.plan                       (t=12ms)
├─ $ai_generation  planner         (t=20ms,  842ms, $0.004)
├─ step.tool_called  apply_diff    (t=880ms)
├─ step.tool_returned  apply_diff  (t=1.2s)
├─ eval.run                        (t=1.2s, suite=gtm-9dim)
│   ├─ $ai_generation  judge:dim0  (t=1.3s,  410ms)
│   ├─ eval.case  dim0             (t=1.7s, passed)
│   ├─ $ai_generation  judge:dim1  (t=1.7s,  395ms)
│   └─ eval.case  dim1             (t=2.1s, passed)
├─ round_scored                    (t=8.4s, score=0.78)
└─ round_kept                      (t=8.5s, delta=+0.06, cost=$0.042)
# Set once per round — every generation + event inside inherits it.
ph.set_context({"$ai_trace_id": round_id})

# or pass explicitly on each LLM call
client.messages.create(
    model="claude-sonnet-4-5",
    messages=[...],
    posthog_distinct_id=agent_id("judge"),
    posthog_properties={"$ai_trace_id": round_id, "$ai_span_id": "judge.dim0"},
)

In PostHog's LLM Observability → Traces view, click the trace for round_id 01HXYZ... and you see the full tree above, with every prompt, completion, tool call, and eval-case side-by-side, ordered by timestamp. That is what replayability looks like.

Trace durability. A trace works even if events arrive out of order or across multiple processes — the orchestrator on claw, the worker on mbp, and the browser on jordan can all stamp the same $ai_trace_id and PostHog stitches them. Use ULIDs for round_id: time-sortable, unique across machines, no coordination needed.

Bridge — Token Machine

Token Machine sits in front of every model call the team makes, routes by task-type, grades users, and already emits three PostHog events of its own. The autoresearch loops push into the same PostHog project, so Token Machine's efficiency view and the autoresearch dashboards correlate through shared properties.

Token Machine emits

posthog.capture('token_machine.request', {
  user_id, task_type, model_used,
  input_tokens, output_tokens, cost_usd,
  quality_score, efficiency_grade, latency_ms,
})

posthog.capture('token_machine.team_summary', {
  total_requests, total_cost, avg_quality,
  worst_performer, best_performer,
  escalation_candidates: ['task_type_a', 'task_type_b'],
})

posthog.capture('token_machine.anomaly', {
  type, user_id, task_type, recommendation,
})

Shared properties — the correlation seam

propertyautoresearch usestoken-machine uses
task_typegtm.rewrite · ar.propose · aa.refinesame taxonomy — don't fork
user_id / distinct_id{project}:{client}:{agent}same id — agents grade the same as humans
model_used / $ai_modelPostHog LLM obs auto-populates $ai_modelTM writes model_used; alias them in a dashboard
cost_usd / $ai_total_cost_usdfrom $ai_generationfrom token_machine.request
quality_scorefrom eval.run avgfrom TM's grader — feed TM's score back as a property on round_scored

Three joined boards

  1. Model-fit by task_type — avg quality_score broken down by task_type × model_used, across both token_machine.request and round_scored. Tells you which local OpenClaw endpoint is genuinely replacing Claude and which is faking it.
  2. Cost compression trend — line: weekly sum(cost_usd) by task_type, with an annotation every time TM publishes a new routing rule. Is the autoresearch-driven routing cutting cost over time?
  3. Anomaly → round regression — join token_machine.anomaly.type='wrong_model' to round_reverted.failure_mode within the same hour. Does a TM misrouting cause downstream revert cascades?

Don't double-count cost. PostHog LLM obs and Token Machine both record cost_usd for the same call. Pick one source of truth per board and filter the other out — usually TM for routed calls, LLM obs for direct calls.

Feature flags & experiments

Flags turn autoresearch behaviors into controllable knobs. Which scorer version tonight? Escalate after 3 stalls or 5? Meta-experiments on the experimenter — perfect for flags that the loop reads at round start.

if ph.feature_enabled("scorer_v2", distinct_id=agent_id("planner")):
    score = score_v2(config)
else:
    score = score_v1(config)

track("round_scored", agent="planner", round_id=round_id,
      score=score, scorer_variant="v2" if v2 else "v1")

Launch a formal experiment with scorer_v2 as the flag and max(round_scored.score) as the metric. After a week PostHog tells you which variant won with confidence intervals — same stats engine SaaS products use to pick a pricing page, applied to which scorer your agent uses.

Useful flags

Annotations

Every deploy, prompt rewrite, and config change gets a vertical line on every chart. Wire a git post-commit hook to PostHog's annotations API and "did that prompt rewrite on Thursday hurt the score?" becomes obvious at a glance.

# post-commit hook → PostHog annotation
curl -X POST https://us.i.posthog.com/api/projects/$POSTHOG_PROJECT_ID/annotations/ \
  -H "Authorization: Bearer $POSTHOG_PERSONAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "planner prompt: tightened role spec, added scoring rubric in-context",
    "date_marker": "2026-04-19T03:00:00Z",
    "scope": "project"
  }'

Ship it

  1. One PostHog project. All three loops write to it; project property separates them.
  2. Drop observability/posthog_client.py into each repo. Identical file; zero drift.
  3. Wrap the model client — one line change, automatic generation tracking.
  4. Build Round Funnel + Score Trend first. These two are 90% of the daily value.
  5. Wire the git annotation hook so every prompt tweak shows up as a vertical line everywhere.

First-week win. Before anything fancy, just the Round Funnel + Score Trend boards will tell you (a) which loop step is leaking rounds and (b) whether last night improved on the night before. That alone is worth the 30 minutes.

Pitfalls