// analytics for the three overnight experimentation loops — autoagent · autoresearch · gtm-autoresearch
Three autonomous loops — AutoAgent, AutoResearch, GTM-AutoResearch — run unattended, produce hundreds of rounds per weekend, and need a single pane of glass for scores, costs, and escalations. PostHog is that pane. Same events across all three; the project property tells them apart.
Treat an agent as a PostHog "user" and a round as a "session." Funnels, retention, cohorts, and experiments fall out of that framing for free.
| project | role | guide |
|---|---|---|
autoagent | General experimentation harness — propose, deploy, measure, keep or revert. The archetype the other two specialize. | autoagent-autoresearch-guide |
autoresearch | Karpathy's autonomous ML experimentation loop as reusable substrate — rounds, scorers, stop conditions, fine-tune pipeline. | organized-ai-docs |
gtm-autoresearch | AutoResearch applied to Google Tag Manager containers. 9-dimension scorer. Nightly → staging workspace + R2 versioned config + fine-tune corpus. | gtm-autoresearch-guide |
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ autoagent │ │autoresearch │ │ gtm-ar │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────┐
│ observability/posthog.py │ ← identical file in all three repos
│ track(event, agent, round_id, …) │
└─────────────────┬───────────────────┘
▼
┌───────────┐
│ PostHog │
└─────┬─────┘
▼
funnels · trends · cohorts · experiments · annotations
# Python — the autoresearch orchestrator
pip install posthog
# Node — web dashboards, CF Workers
npm install posthog-node
npm install posthog-js # browser only
# .env — shared across all three projects
POSTHOG_API_KEY=phc_xxx...
POSTHOG_HOST=https://us.i.posthog.com
POSTHOG_PROJECT_ID=12345
AUTORESEARCH_PROJECT=gtm-autoresearch # or autoagent | autoresearch
AUTORESEARCH_CLIENT=acme-corp # optional
Identical file in all three repos at observability/posthog_client.py:
import os, time
from posthog import Posthog
ph = Posthog(os.environ["POSTHOG_API_KEY"],
host=os.environ.get("POSTHOG_HOST", "https://us.i.posthog.com"))
PROJECT = os.environ["AUTORESEARCH_PROJECT"]
CLIENT = os.environ.get("AUTORESEARCH_CLIENT", "internal")
def agent_id(name: str) -> str:
return f"{PROJECT}:{CLIENT}:{name}"
def track(event, *, agent, round_id=None, **props):
ph.capture(
distinct_id=agent_id(agent),
event=event,
properties={
"project": PROJECT, "client": CLIENT,
"round_id": round_id, "ts": time.time(),
**props,
},
)
from observability.posthog_client import track
def run_round(agent, round_id, proposal):
track("round_started", agent=agent, round_id=round_id,
model=proposal.model, diff_lines=len(proposal.diff))
result = apply_and_score(proposal)
track("round_scored", agent=agent, round_id=round_id,
score=result.score, **result.dimensions)
if result.score > baseline:
track("round_kept", agent=agent, round_id=round_id,
delta=result.score - baseline, cost_usd=result.cost_usd)
else:
track("round_reverted", agent=agent, round_id=round_id,
reason=result.failure_mode, cost_usd=result.cost_usd)
Privacy. These events describe the agent's behavior, not end-user behavior. If a client's GTM container carries identifiers, redact in PostHog's before_send hook before it leaves your box.
Nine events. Every loop emits the same set. Per-project differences live in properties, not event names.
| event | what it means | key properties |
|---|---|---|
round_started | new round begins, before any model call | model · diff_lines · baseline_score · n_prior_rounds |
proposal_generated | agent produced a candidate change | prompt_tokens · completion_tokens · cost_usd · latency_ms |
proposal_applied | diff landed in staging (GTM workspace, branch, etc.) | target · diff_bytes · apply_latency_ms |
round_scored | scorer returned a verdict — GTM includes 9 dim_* props | score · dim_coverage · dim_correctness · dim_resilience · … |
round_kept | score beat baseline; change promoted | delta · new_best · cost_usd |
round_reverted | score missed; change rolled back | failure_mode · cost_usd · regressed_dimensions[] |
model_escalated | cheap model stalled; bumped tier | from_model · to_model · stall_rounds · reason |
stop_triggered | max rounds, budget, plateau, kill switch | reason · rounds_completed · total_cost_usd · final_best_score |
finetune_batch_published | kept rounds rolled into training batch | batch_id · n_examples · r2_key · target_model |
| property | purpose |
|---|---|
project | separates autoagent · autoresearch · gtm-autoresearch |
client | multi-tenant filter |
run_id | ULID per nightly run — group rounds into batches |
round_id | ULID per round — join events across a round's lifecycle |
agent | which agent role emitted it (planner · critic · coder · …) |
git_sha | short SHA of the loop code — correlate score drops to regressions |
Six boards that answer the six questions you'll ask every morning.
| board | insight type | answers |
|---|---|---|
| Round Funnel | Funnel: started → generated → applied → scored → kept | where rounds drop off |
| Score Trend | Line: max(round_scored.score) per day, by project | is each night improving on the last? |
| Dimension Heatmap (GTM) | Stickiness: avg of each dim_* on round_scored | which dims chronically gate the score? |
| Cost / Kept | Formula: sum(cost_usd) / count(round_kept) per day | dollars per kept config — trending? |
| Escalation Chart | Line: model_escalated count by to_model | escalating too eagerly? too late? |
| Stop-Reason Mix | Breakdown: stop_triggered.reason over time | budget-capped · plateau · round cap? |
SELECT
toDate(timestamp) AS day,
properties.project AS project,
max(toFloat(properties.score)) AS best_score,
count() AS n_rounds
FROM events
WHERE event = 'round_scored'
AND timestamp > now() - INTERVAL 30 DAY
GROUP BY day, project
ORDER BY day DESC
Wrap the model client once. Every call becomes a $ai_generation event with prompt, completion, token counts, cost in USD, latency, and any properties you attach — correlated to the parent round_id automatically.
from posthog.ai.anthropic import Anthropic
client = Anthropic(posthog_client=ph) # drop-in replacement
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
messages=[{"role": "user", "content": proposal_prompt}],
posthog_distinct_id=agent_id("planner"),
posthog_properties={
"round_id": round_id,
"project": PROJECT,
"phase": "proposal", # proposal | critique | score | refine
"task_type": "gtm.rewrite", # matches token-machine taxonomy
},
)
| captured property | shape | why it matters |
|---|---|---|
$ai_input | full prompt messages array | inspect any bad round: what did the planner actually see? |
$ai_output_choices | completion text(s) | sanity-check refusals, JSON corruption, truncation |
$ai_input_tokens / $ai_output_tokens | integers | per-request cost math and prompt-bloat detection |
$ai_total_cost_usd | float | roll up to any cohort (client, task_type, phase) |
$ai_latency | float seconds | find the p95 offenders — usually long-context prompts |
$ai_model / $ai_provider | string | compare models apples-to-apples on the same task |
$ai_is_error | boolean | rate-limit, timeout, and schema-mismatch surface as first-class events |
Redaction at the boundary. Pass posthog_privacy_mode=True to capture token counts + cost without the prompt text itself. For selective redaction, strip fields in your own wrapper before handing the client to the loop — PostHog never sees what you don't send.
from posthog.ai.openai import OpenAI
from posthog.ai.anthropic import Anthropic
from posthog.ai.gemini import Client as Gemini
from posthog.ai.langchain import CallbackHandler # for chains/agents
All of them emit $ai_generation with the same envelope — so a model swap is a one-line change and the dashboards don't move.
A round isn't just "generate → score." It's a sequence of steps the agent takes: plan, tool call, observation, reflection, retry. Each step is a log event. When you can replay a round step-by-step in PostHog's activity view, debugging becomes a scroll instead of a stare at a stdout dump.
| event | when fired | key properties |
|---|---|---|
step.plan | agent decides what to do next | plan · next_action · confidence |
step.tool_called | tool / function invocation | tool · args_json · timeout_ms |
step.tool_returned | tool result back | tool · ok · latency_ms · result_bytes |
step.observation | external state sampled (metric, log tail) | source · value · delta_from_last |
step.reflection | agent critiques own output | verdict · issues[] · retry |
step.retry | agent retries after failure | cause · attempt · prior_error |
def step_plan(agent, round_id, plan, next_action):
track("step.plan", agent=agent, round_id=round_id,
step_id=ulid(), plan=plan[:500], next_action=next_action)
def step_tool(agent, round_id, tool, args, timeout_ms):
step_id = ulid()
track("step.tool_called", agent=agent, round_id=round_id,
step_id=step_id, tool=tool, args_json=json.dumps(args)[:2000],
timeout_ms=timeout_ms)
t0 = time.time()
try:
result = TOOLS[tool](**args)
track("step.tool_returned", agent=agent, round_id=round_id,
step_id=step_id, tool=tool, ok=True,
latency_ms=int((time.time()-t0)*1000),
result_bytes=len(str(result)))
return result
except Exception as e:
track("step.tool_returned", agent=agent, round_id=round_id,
step_id=step_id, tool=tool, ok=False, error=str(e)[:400],
latency_ms=int((time.time()-t0)*1000))
raise
Step ULIDs matter. Every step gets a step_id ULID. Pair step.tool_called with its matching step.tool_returned via that id — otherwise a retried or re-ordered tool call will corrupt your funnel.
The GTM 9-dimension scorer (and whatever AutoAgent/AutoResearch grow into) is itself an LLM-powered program. If you only capture the final score, you lose the why — and the why is what you use to improve the scorer.
| event | meaning | properties |
|---|---|---|
eval.run | a full eval suite started on a candidate | suite · n_cases · scorer_version |
eval.case | one case within the suite — per-dimension detail | case_id · dim · score · rationale · passed |
eval.regression | a previously-passing case now fails | case_id · dim · prior_score · new_score · delta |
def run_eval_suite(agent, round_id, candidate):
suite_id = ulid()
track("eval.run", agent=agent, round_id=round_id,
suite_id=suite_id, suite="gtm-9dim", scorer_version="v2.3",
n_cases=9)
scores = {}
for dim in GTM_DIMENSIONS:
verdict = llm_judge(candidate, rubric=dim.rubric) # also traced as $ai_generation
scores[dim.key] = verdict.score
track("eval.case", agent=agent, round_id=round_id,
suite_id=suite_id, case_id=dim.key, dim=dim.key,
score=verdict.score, rationale=verdict.rationale[:800],
passed=verdict.score >= dim.threshold)
if dim.key in prior_best and verdict.score < prior_best[dim.key] - 0.05:
track("eval.regression", agent=agent, round_id=round_id,
case_id=dim.key, dim=dim.key,
prior_score=prior_best[dim.key], new_score=verdict.score,
delta=verdict.score - prior_best[dim.key])
return scores
eval.case.dims is lowest on average, by client?scorer_version=v2.4 to half of rounds via feature flag; compare round_scored.score distributions.eval.regression count > 0 per run. Page yourself when a prompt tweak breaks a previously-passing case.
A trace is the story of a round from round_started to round_kept (or round_reverted), with every sub-step and every LLM call threaded under it. PostHog's LLM obs auto-creates traces when events share a $ai_trace_id; we use round_id as that trace id so everything in a round lives under one timeline.
round_id = 01HXYZ... ←── trace root │ ├─ round_started (t=0ms) ├─ step.plan (t=12ms) ├─ $ai_generation planner (t=20ms, 842ms, $0.004) ├─ step.tool_called apply_diff (t=880ms) ├─ step.tool_returned apply_diff (t=1.2s) ├─ eval.run (t=1.2s, suite=gtm-9dim) │ ├─ $ai_generation judge:dim0 (t=1.3s, 410ms) │ ├─ eval.case dim0 (t=1.7s, passed) │ ├─ $ai_generation judge:dim1 (t=1.7s, 395ms) │ └─ eval.case dim1 (t=2.1s, passed) ├─ round_scored (t=8.4s, score=0.78) └─ round_kept (t=8.5s, delta=+0.06, cost=$0.042)
# Set once per round — every generation + event inside inherits it.
ph.set_context({"$ai_trace_id": round_id})
# or pass explicitly on each LLM call
client.messages.create(
model="claude-sonnet-4-5",
messages=[...],
posthog_distinct_id=agent_id("judge"),
posthog_properties={"$ai_trace_id": round_id, "$ai_span_id": "judge.dim0"},
)
In PostHog's LLM Observability → Traces view, click the trace for round_id 01HXYZ... and you see the full tree above, with every prompt, completion, tool call, and eval-case side-by-side, ordered by timestamp. That is what replayability looks like.
Trace durability. A trace works even if events arrive out of order or across multiple processes — the orchestrator on claw, the worker on mbp, and the browser on jordan can all stamp the same $ai_trace_id and PostHog stitches them. Use ULIDs for round_id: time-sortable, unique across machines, no coordination needed.
Token Machine sits in front of every model call the team makes, routes by task-type, grades users, and already emits three PostHog events of its own. The autoresearch loops push into the same PostHog project, so Token Machine's efficiency view and the autoresearch dashboards correlate through shared properties.
posthog.capture('token_machine.request', {
user_id, task_type, model_used,
input_tokens, output_tokens, cost_usd,
quality_score, efficiency_grade, latency_ms,
})
posthog.capture('token_machine.team_summary', {
total_requests, total_cost, avg_quality,
worst_performer, best_performer,
escalation_candidates: ['task_type_a', 'task_type_b'],
})
posthog.capture('token_machine.anomaly', {
type, user_id, task_type, recommendation,
})
| property | autoresearch uses | token-machine uses |
|---|---|---|
task_type | gtm.rewrite · ar.propose · aa.refine | same taxonomy — don't fork |
user_id / distinct_id | {project}:{client}:{agent} | same id — agents grade the same as humans |
model_used / $ai_model | PostHog LLM obs auto-populates $ai_model | TM writes model_used; alias them in a dashboard |
cost_usd / $ai_total_cost_usd | from $ai_generation | from token_machine.request |
quality_score | from eval.run avg | from TM's grader — feed TM's score back as a property on round_scored |
quality_score broken down by task_type × model_used, across both token_machine.request and round_scored. Tells you which local OpenClaw endpoint is genuinely replacing Claude and which is faking it.sum(cost_usd) by task_type, with an annotation every time TM publishes a new routing rule. Is the autoresearch-driven routing cutting cost over time?token_machine.anomaly.type='wrong_model' to round_reverted.failure_mode within the same hour. Does a TM misrouting cause downstream revert cascades?Don't double-count cost. PostHog LLM obs and Token Machine both record cost_usd for the same call. Pick one source of truth per board and filter the other out — usually TM for routed calls, LLM obs for direct calls.
Flags turn autoresearch behaviors into controllable knobs. Which scorer version tonight? Escalate after 3 stalls or 5? Meta-experiments on the experimenter — perfect for flags that the loop reads at round start.
if ph.feature_enabled("scorer_v2", distinct_id=agent_id("planner")):
score = score_v2(config)
else:
score = score_v1(config)
track("round_scored", agent="planner", round_id=round_id,
score=score, scorer_variant="v2" if v2 else "v1")
Launch a formal experiment with scorer_v2 as the flag and max(round_scored.score) as the metric. After a week PostHog tells you which variant won with confidence intervals — same stats engine SaaS products use to pick a pricing page, applied to which scorer your agent uses.
scorer_v2 — rubric upgradeaggressive_escalation — escalate after 2 stalls instead of 5diff_size_cap_32 — reject proposals > 32 linesnightly_budget_20usd — hard spend cap per rungtm_dim_weights_v3 — new dimension weightingEvery deploy, prompt rewrite, and config change gets a vertical line on every chart. Wire a git post-commit hook to PostHog's annotations API and "did that prompt rewrite on Thursday hurt the score?" becomes obvious at a glance.
# post-commit hook → PostHog annotation
curl -X POST https://us.i.posthog.com/api/projects/$POSTHOG_PROJECT_ID/annotations/ \
-H "Authorization: Bearer $POSTHOG_PERSONAL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"content": "planner prompt: tightened role spec, added scoring rubric in-context",
"date_marker": "2026-04-19T03:00:00Z",
"scope": "project"
}'
project property separates them.observability/posthog_client.py into each repo. Identical file; zero drift.First-week win. Before anything fancy, just the Round Funnel + Score Trend boards will tell you (a) which loop step is leaking rounds and (b) whether last night improved on the night before. That alone is worth the 30 minutes.
distinct_id raw. Prefix with {project}:{client}: or every client's "planner" merges into one PostHog user.cost_usd=0.042, not "$0.042".ph.shutdown() at the end of a nightly run — otherwise queued events evaporate on process exit.dim_* as its own float property so breakdowns work.