Numbers you can run yourself.
Every Hipp0 benchmark is reproducible, versioned, and shipped with the raw outputs. No vendor-tuned demos, no cherry-picked datasets.
External Benchmark Roadmap
Five priority benchmarks covering long-term memory, multi-hop reasoning, long-context retrieval, and grounded answer quality.
LongMemEval
RunningLong-horizon memory evaluation across 500 sessions and 5 task types.
500 user sessions × 5 task types (single-hop, multi-hop, temporal, contradiction, abstention).
The most rigorous public test of long-term memory retrieval for agents. Tests whether a system actually remembers — or just retrieves.
BEIR
PlannedA heterogeneous retrieval benchmark covering 18 datasets across 9 domains.
18 datasets: MS MARCO, TREC-COVID, NFCorpus, FiQA, ArguAna, SCIDOCS, and more.
Industry-standard IR benchmark. Measures generalization across domains, query types, and corpus sizes.
HotpotQA
PlannedMulti-hop reasoning over Wikipedia with supporting-fact annotations.
113k question-answer pairs requiring reasoning over multiple documents.
Tests whether Hipp0's decision graph can chain multi-hop inferences the way a real agent team does.
RULER
PlannedStress test for long-context retrieval at 4k → 128k tokens.
13 synthetic tasks measuring needle-in-a-haystack recall at increasing context lengths.
Agents work across huge histories. RULER measures whether Hipp0's compile stays accurate as corpora grow.
CRAG
PlannedComprehensive RAG benchmark with 4,409 factual questions across 5 domains.
4.4k questions spanning finance, sports, music, movies, and open domains.
Tests grounded answer quality, not just retrieval. Matches how Hipp0's compiled context is actually used.
LongMemEval, end to end.
Hipp0 ships a full LongMemEval harness: loader, ingester, runner, scorer, and CLI. Clone the repo, bring your own API key, and reproduce every number we publish.
$ pnpm tsx benchmarks/external/longmemeval/cli.ts
Loading 500 sessions...
✔ Corpus loaded (500 sessions, 12,847 turns)
✔ Decisions ingested (2,341 decisions)
✔ Running 5 task types...
[single-hop] 410/500
[multi-hop] 384/500
[temporal] 395/500
[contradiction] 462/500
[abstention] 448/500
Writing results to ./benchmark-output/
Methodology
Four rules we hold ourselves to. Every single benchmark on this page follows all four.
Reproducibility first
Every benchmark ships as runnable code under benchmarks/external/ in the open-source repo. Clone, run, verify. No cherry-picking.
Same hardware, same LLM
All numbers are produced on a single VPS (8 vCPU, 32GB RAM) against a single configured LLM backend. No distributed scaling games.
Published raw outputs
We publish the raw score outputs, prompts, and retrieval traces alongside the final numbers. You can audit every claim.
Adversarial baselines
We pit Hipp0 against well-tuned baselines (BM25, Contriever, E5, naive RAG), not strawmen. Wins earned the hard way.
Don't trust us. Run it.
Every benchmark on this page is runnable from the Hipp0 repo. Bring your own API key and verify every number.