Open Benchmarks

Numbers you can run yourself.

Every Hipp0 benchmark is reproducible, versioned, and shipped with the raw outputs. No vendor-tuned demos, no cherry-picked datasets.

78%
Recall@5
+39% over naive RAG
Internal eval, 500-decision corpus
0.92
Contradiction F1
Detection accuracy
2k labeled decision pairs
20-33x
Token Compression
H0C Ultra format
vs. raw JSON context
25ms
P95 Latency
Compile at 500 decisions
Single-node PostgreSQL + pgvector

External Benchmark Roadmap

Five priority benchmarks covering long-term memory, multi-hop reasoning, long-context retrieval, and grounded answer quality.

LongMemEval

Running

Long-horizon memory evaluation across 500 sessions and 5 task types.

Source
What it tests

500 user sessions × 5 task types (single-hop, multi-hop, temporal, contradiction, abstention).

Why it matters

The most rigorous public test of long-term memory retrieval for agents. Tests whether a system actually remembers — or just retrieves.

BEIR

Planned

A heterogeneous retrieval benchmark covering 18 datasets across 9 domains.

Source
What it tests

18 datasets: MS MARCO, TREC-COVID, NFCorpus, FiQA, ArguAna, SCIDOCS, and more.

Why it matters

Industry-standard IR benchmark. Measures generalization across domains, query types, and corpus sizes.

HotpotQA

Planned

Multi-hop reasoning over Wikipedia with supporting-fact annotations.

Source
What it tests

113k question-answer pairs requiring reasoning over multiple documents.

Why it matters

Tests whether Hipp0's decision graph can chain multi-hop inferences the way a real agent team does.

RULER

Planned

Stress test for long-context retrieval at 4k → 128k tokens.

Source
What it tests

13 synthetic tasks measuring needle-in-a-haystack recall at increasing context lengths.

Why it matters

Agents work across huge histories. RULER measures whether Hipp0's compile stays accurate as corpora grow.

CRAG

Planned

Comprehensive RAG benchmark with 4,409 factual questions across 5 domains.

Source
What it tests

4.4k questions spanning finance, sports, music, movies, and open domains.

Why it matters

Tests grounded answer quality, not just retrieval. Matches how Hipp0's compiled context is actually used.

First benchmark in flight

LongMemEval, end to end.

Hipp0 ships a full LongMemEval harness: loader, ingester, runner, scorer, and CLI. Clone the repo, bring your own API key, and reproduce every number we publish.

longmemeval harness

$ pnpm tsx benchmarks/external/longmemeval/cli.ts

Loading 500 sessions...

Corpus loaded (500 sessions, 12,847 turns)

Decisions ingested (2,341 decisions)

Running 5 task types...

[single-hop] 410/500

[multi-hop] 384/500

[temporal] 395/500

[contradiction] 462/500

[abstention] 448/500

Writing results to ./benchmark-output/

Methodology

Four rules we hold ourselves to. Every single benchmark on this page follows all four.

01

Reproducibility first

Every benchmark ships as runnable code under benchmarks/external/ in the open-source repo. Clone, run, verify. No cherry-picking.

02

Same hardware, same LLM

All numbers are produced on a single VPS (8 vCPU, 32GB RAM) against a single configured LLM backend. No distributed scaling games.

03

Published raw outputs

We publish the raw score outputs, prompts, and retrieval traces alongside the final numbers. You can audit every claim.

04

Adversarial baselines

We pit Hipp0 against well-tuned baselines (BM25, Contriever, E5, naive RAG), not strawmen. Wins earned the hard way.

Don't trust us. Run it.

Every benchmark on this page is runnable from the Hipp0 repo. Bring your own API key and verify every number.