Karpathy-style LLM Wikis for the Hermes Agent — persistent, compounding knowledge bases curated by AI agents

Quality Audit & Improvement Roadmap

TL;DR — Hermes Wiki’s deterministic core (pipeline, projection, lint) is the strongest part of the system: well-tested, rebuildable, and attributed. The quality bottleneck is agent output — the pages agents actually write — which today is entirely unmeasured. The single highest-leverage investment is an eval harness that judges agent-produced wikis (golden corpus + retrieval relevance sets + LLM-judge rubrics), paired with a skill upgrade that gives agents an explicit synthesis/dedup/contradiction protocol.

Scope & method: audited at commit 0f2d1d4 by static code review of the pipeline, lint, search, projection, skills, fixtures, and CI — not runtime profiling. Four dimensions: content quality, retrieval quality, structural integrity, test/CI infrastructure. Every finding carries a stable ID (CQ-*, RQ-*, SI-*, TI-*) referenced by the roadmap.


Executive Summary

Dimension Maturity Headline gap
Content quality Weak No measurement of agent-written pages; naive DefaultProcessor heuristics; skills silent on synthesis fidelity, dedup, and contradictions
Retrieval quality Adequate, unmeasured Solid FTS5/BM25 with identifier normalization, but zero relevance evals before the planned embedding ranker lands
Structural integrity Strong 18 lint checks + health score + drift detection; gap is no trend tracking and no link-graph metrics
Test/CI infrastructure Strong unit, absent elsewhere 236 tests, but no coverage gate, no golden/snapshot tests, no evals in CI, no perf benchmarks

North star: the core pipeline (hermes_wiki/pipeline.py) is deterministic — no LLM calls. Generated-wiki quality is therefore driven almost entirely by agents following the wiki-ingestion/wiki-writing skills. Improving wiki quality means (1) measuring agent output and (2) raising the ceiling of what the skills ask agents to do. Core-code hardening is a distant third — it is already the best-covered part of the system.


Audit Findings

Dimension 1 — Content Quality

Strengths

Gaps

Target state

Dimension 2 — Retrieval Quality

Strengths

Gaps

Target state

Dimension 3 — Structural Integrity

Strengths

Gaps

Target state

Dimension 4 — Test/CI Infrastructure

Strengths

Gaps

Target state


Eval Harness Architecture

This is the centerpiece recommendation. Because the core has no LLM, evals must judge agent output, not core functions. That splits the harness into two families with different execution models:

Family Examples Determinism Where it runs
Deterministic evals golden structure, retrieval qrels, graph metrics Fully reproducible CI, gated (pytest -m eval)
LLM-judge content evals faithfulness, citation accuracy, dedup, contradiction Judge-model variance Scheduled + on-demand, never PR-blocking (pytest -m eval_llm)

Default execution mode: transcript replay. Agent ingestion/writing runs are recorded once and committed as fixtures; evals score the recorded outputs. This makes content evals deterministic and cheap for regression purposes. A separate --live mode regenerates transcripts on a schedule to catch skill/model drift.

Directory layout

evals/
  README.md                      # how to run, how to add cases
  conftest.py                    # registers markers: eval, eval_llm
  corpus/                        # golden corpus: input sources + expected outcomes
    agent-memory/
      sources/                   # raw source files (same style as fixtures/sources/)
      expected_structure.yaml    # expected pages: ids, types, citations, cross-links
      relevance.yaml             # qrels: query -> ranked expected page ids
    multi-source-contradiction/  # designed to exercise contradiction handling
      sources/
      expected_structure.yaml
      relevance.yaml
  transcripts/                   # committed recorded agent runs
    agent-memory.transcript.json # tool calls + final wiki manifest hash
  rubrics/
    faithfulness.md              # claim decomposition: supported-claim fraction vs cited sources
    citation_accuracy.md         # every claim cites a real source page; no fabricated cites
    dedup.md                     # no near-duplicates; "2+ sources or central" threshold honored
    contradiction.md             # conflicting sources surfaced with dates, not silently merged
  harness/
    runner.py                    # load case -> run/replay -> score -> store
    structural.py                # compare generated wiki to expected_structure.yaml
    retrieval.py                 # precision@k, recall@k, MRR, nDCG over relevance.yaml
    graph.py                     # link-graph + index-coverage metrics over a wiki
    judge.py                     # LLM-judge wrapper (rubric -> score + rationale)
    replay.py                    # rebuild wiki from a recorded transcript
    scoring.py                   # metric dataclasses, aggregation, thresholds
    store.py                     # append results to evals/results/ (JSONL, git-tracked)
  results/                       # {date, case, metric, value, commit} baselines
  test_structural_evals.py       # @pytest.mark.eval     — deterministic, CI-gated
  test_retrieval_evals.py        # @pytest.mark.eval     — deterministic, CI-gated
  test_graph_evals.py            # @pytest.mark.eval     — deterministic, CI-gated
  test_content_evals.py          # @pytest.mark.eval_llm — scheduled only

Content-quality evals

Golden-structure eval (deterministic, CI-safe). For each corpus case, run the current generation path (DefaultProcessor today; agent transcript replay later) over sources/, then assert against expected_structure.yaml:

# expected_structure.yaml
pages:
  - id: concepts/agent-memory
    type: concept
    must_cite: [sources/2026-06-06-agent-memory-article]
    must_link: [entities/hermes]
forbidden:
  duplicate_titles: true
min_health_score: 0.9

This is the cheapest, highest-signal content eval and needs no LLM. It directly pins CQ-1/CQ-2 behavior once DefaultProcessor is enriched.

LLM-judge evals (scheduled). judge.py sends (source text + generated page body + rubric) to a pinned judge model at temperature 0 and parses a structured verdict: {score: 1-5, pass: bool, rationale: str, violations: [...]}. The faithfulness rubric uses claim decomposition (FActScore/RAGAS style — see Prior Art): split the page into atomic claims, verify each against its cited sources, score the supported fraction; this outperforms holistic 1–5 scoring and yields per-claim violations to act on. Rubrics live as markdown in rubrics/ so changes are PR-reviewable. Results (including model id and rationale) append to results/ for diffing across runs. A --dry-run flag validates case/rubric wiring without API calls so the suite’s plumbing is testable in plain CI. Eval-case files in corpus/ adopt Anthropic’s Agent Skills JSON case shape for interoperability.

Retrieval evals

retrieval.py loads relevance.yaml, runs search_wiki(query, wiki=<fixture>, limit=k), and computes precision@k, recall@k, MRR, and nDCG — per-query and aggregate. Cases run against the existing populated fixture (fixtures/factory.build_populated_home) plus corpus wikis.

# relevance.yaml
queries:
  - q: "agent memory"
    relevant: [concepts/agent-memory, sources/2026-06-06-agent-memory-article]
  - q: "getCwd"          # identifier-normalization regression case
    relevant: [concepts/get-cwd]

Rollout: non-blocking CI report first; once the embedding ranker work begins, gate with a tolerance (fail if aggregate MRR drops more than X% versus the last stored baseline in results/). Capture the BM25 baseline now — it is the safety net for RQ-1/RQ-2.

Structural metrics over time

graph.py builds the link graph from page_links (hermes_wiki/db.py:188) and computes: orphan rate, connected-component count, % pages reachable from index.md, mean out-degree, and dangling-link count. store.py appends health score + graph metrics keyed by {commit, wiki, date}; a small renderer turns the JSONL into a trendline table. This reuses the already-persisted health score (_record_lint_result) rather than inventing new storage.

CLI, pytest, and CI integration


Prior Art & Solution Engineering Notes (research-validated)

A deep-research pass (22 sources fetched, 107 claims extracted, 25 adversarially verified — 23 confirmed, 2 refuted) validated and sharpened the roadmap. Findings that change how the workstreams should be engineered:

Upstream already wrote most of F1 — adapt, don’t invent

The Hermes Agent bundles a research-llm-wiki skill (v2.1.0) that encodes the exact protocols F1 calls for, verbatim:

The upstream contradictions: frontmatter convention maps directly onto the existing contradictions projection column and the unresolved_contested lint check — adopting it costs nothing schema-wise. F1 should port and adapt this prose into wiki-writing/wiki-ingestion SKILL.md rather than authoring new protocols, keeping local additions (e.g., the faithfulness self-check) clearly separated from upstream-derived rules. Notably, upstream v2.1.0 contains zero eval, retrieval, or testing protocols — the harness and retrieval workstreams remain genuinely new work.

Skill engineering: follow the host’s template and Anthropic’s evals-first loop

Eval methodology grounding

Integration cautions (refuted claims)

Two plausible-sounding claims about Hermes Agent internals were refuted during verification and must not be assumed:

  1. Plugin discovery via ~/.hermes/plugins/ + .hermes/plugins/ + pip entry points with import-time tool self-registration — not how it works.
  2. Memory providers implement a MemoryProvider ABC with get_tool_schemas()/handle_tool_call()/etc. — not the actual interface.

Any work touching the adapter surface (adapters/hermes/) should be verified against the live hermes-agent source, not docs-derived assumptions.

Skill precedence — verified against hermes-agent 0.16.0 source (2026-06-07). There is no name collision risk: plugin-registered skills live in a separate registry with qualified names (wiki:<name>, hermes_cli/plugins.py:957-1000), while bundled skills are seeded to ~/.hermes/skills/ and tracked via .bundled_manifest. The real risk is an attention asymmetry: bundled/local skills appear in the system prompt’s <available_skills> block and as slash commands (implicit activation), whereas plugin skills are explicit-load only (skill_view("wiki:wiki-writing")) and never appear in the system prompt (tools/skills_tool.py:851-897, agent/prompt_builder.py:1254-1330). Consequences:

  1. If the bundled research-llm-wiki skill is present in a user’s ~/.hermes/skills/, its guidance (including the ^[...] provenance-marker syntax this wiki rejects) activates by default, while this plugin’s per-wiki skills require an explicit load. Upstream guidance wins unless mitigated.
  2. Mitigations: users can disable the bundled skill via skills.disabled: [research-llm-wiki] in Hermes config.yaml (tools/skills_tool.py:546-566); and the wiki prompt injection should instruct agents to load the wiki’s assigned skills (wiki:wiki-writing/wiki:wiki-ingestion per SCHEMA.md) before writing — a small prompt.py enhancement, added to the roadmap as F9.

Feature Recommendations

ID Feature Dimension Leverage Effort Notes
F1 Skill upgrade: synthesis/dedup/contradiction protocol — port the upstream research-llm-wiki v2.1.0 protocols into wiki-writing/SKILL.md and wiki-ingestion/SKILL.md (dedup threshold “2+ sources or central”, date-aware contradiction handling with contradictions: frontmatter, per-paragraph provenance markers on 3+-source pages), plus a local faithfulness self-check (every claim traceable to a cited source page). Structure per the Hermes template: rules under Procedure, self-check under Verification; record upstream_skill: research-llm-wiki + upstream_skill_version: 2.1.0 in SKILL.md metadata so upstream drift is reviewable on dependency bumps Content High S Adapt upstream prose, don’t invent (see Prior Art); addresses CQ-3/CQ-4; what the LLM-judge evals score against
F2 Enrich DefaultProcessor — replace the single-regex entity/concept heuristic (pipeline.py:1503) with a scored signal set (title + body keyword density + source type); upgrade _summary_sentence (pipeline.py:1569) to extract a lead paragraph Content Medium M Addresses CQ-1/CQ-2; land golden snapshots (T2) first so the change reviews as a diff
F3 Dedup-on-create suggestion — on wiki_create_page/create-page, BM25-search the title and warn when a high-similarity page exists (“did you mean to update X?”) Content/Structural High M Operationalizes the rule F1 can only state; reuses existing search
F4 Citation-verification lint check — every sources: entry resolves to a real source page; stretch: claims adjacent to citations are non-empty Content/Structural High S Complements existing missing_citation (lint.py:367)
F5 Taxonomy enforcement via Phase 2 hooks — implement the already-designed validate_tags/suggest_tags hooks (hooks architecture) so invalid tags are caught at write time, not only by lint Structural Medium M Builds on planned work — not a new design
F6 Health trendline + graph metrics surface — persist health-score history, surface trendline + graph.py metrics in the dashboard health card and a CLI report Structural Medium S/M Addresses SI-1/SI-2/SI-3; storage already exists
F7 Capture BM25 retrieval baseline — land relevance fixtures + eval retrieval and snapshot current numbers before any ranker change Retrieval High S Addresses RQ-1; cheap insurance for the SPEC’s embedding extension point
F8 Contradiction detection assist — flag when a new page’s claims contradict an existing cited page (heuristics first, LLM-judge later) Content Medium L Sequence after F1 + judge evals prove the gap with data
F9 Prompt injection loads assigned wiki skills — extend prompt.py so the Available Wikis block instructs agents to skill_view the wiki’s SCHEMA.md-assigned wiki:* skills before writing Content High S Counters the bundled-skill attention asymmetry (see Integration cautions): plugin skills are explicit-load only and otherwise lose to implicit upstream guidance

Ordering rationale: per Anthropic’s evals-before-docs loop (see Prior Art), write the eval corpus cases for the F1 behaviors (dedup, contradiction, faithfulness — 3+ scenarios each) first or together with the F1 prose, so the skill is authored against measurable targets. Then F7 (baselines before behavior changes), then F2–F6 in leverage order. F1 remains the highest-ROI item — and is now mostly a porting job from upstream v2.1.0 rather than original authoring.


Test Suite Recommendations

ID Item Covers Effort Strength
T1 Coverage reporting + threshold — add pytest-cov to CI, report on PRs, set the floor at the observed level and ratchet TI-1 S Strong
T2 Golden snapshots of DefaultProcessor output — snapshot full generated pages (frontmatter + body) for each fixtures/sources/* sample, via syrupy or committed expected files TI-2 S/M Strong — prerequisite for F2
T3 Property-based tests (targeted)hypothesis for exactly two invariants: (a) frontmatter write→read round-trip preserves data; (b) projection rebuild idempotency (rebuilding twice == once; rebuild-from-files == original) TI-3 M Strong but deliberately narrow
T4 e2e CLI tests — a handful of subprocess-level runs (create → ingest → search → lint) against a temp home TI-4 M Moderate; keep small
T5 React component tests — Vitest + Testing Library for the health card and inbox manager only TI-5 M Optional
T6 Performance benchmarks — bulk ingest of N sources and search latency at 100/500 pages (the Phase-1 target); scheduled, tracked in evals/results/, not gated TI-6 M Moderate; most valuable right before the embedding ranker

Deliberately de-prioritized: broad property testing beyond the two invariants, full-dashboard Playwright e2e, and mutation testing — the codebase’s size and risk profile don’t justify them yet.


Prioritized Roadmap

Now

Item Dimension Effort Depends on
✅ F1 — Skill synthesis/dedup/contradiction protocol (skills v1.1.0, 2026-06-07) Content S
✅ Verify skill precedence vs upstream bundled llm-wiki skill (2026-06-07 — see Integration cautions; spawned F9) Content S
✅ F9 — Prompt injection loads assigned wiki skills (write-guidance line + per-wiki override annotations, 2026-06-07) Content S
✅ Eval scaffold (evals/ harness, markers, runner, structural eval; CI-gated 2026-06-07) Test/Content M
✅ F7 — BM25 retrieval baseline (evals/results/bm25-baseline.jsonl, 2026-06-07) Retrieval S Eval scaffold
✅ T1 — Coverage floor 84% in CI (observed 86%, 2026-06-07) Test S
✅ T2 — Golden snapshots of DefaultProcessor output (tests/golden/, 2026-06-07) Test/Content S
✅ F4 — unresolved_citation lint check (19th check, 2026-06-07) Content/Structural S

Next

Item Dimension Effort Depends on
F2 — Enrich DefaultProcessor classify/summary Content M T2
F3 — Dedup-on-create suggestion Content/Structural M
LLM-judge content evals + scheduled workflow Content/Test M Eval scaffold, F1
F6 — Health trendline + graph metrics Structural S/M Eval scaffold
T3 — Property tests (frontmatter round-trip, projection idempotency) Test M
Retrieval regression gate with tolerance Retrieval/Test S F7

Later

Item Dimension Effort Depends on
F5 — Taxonomy hooks (validate/suggest tags) Structural M Phase 2 hooks
F8 — Contradiction detection assist Content L LLM-judge evals
T4 — e2e CLI tests Test M
T6 — Performance benchmarks Test/Retrieval M
Embedding ranker behind the eval gate Retrieval L Retrieval gate, T6
T5 — React component tests Test M

Closing rule: the embedding ranker (a documented SPEC extension point) must not ship until the retrieval eval gate exists — without it, the ranker’s impact is unmeasurable.


Appendix

Finding index

ID Finding Severity
CQ-1 Naive derived-page classification (pipeline.py:1503) Medium
CQ-2 Regex first-sentence summary (pipeline.py:1569) Medium
CQ-3 Skills silent on synthesis fidelity High
CQ-4 No proactive contradiction handling Medium
CQ-5 Unmanaged confidence field Low
RQ-1 No relevance evals High
RQ-2 No ranking regression guard Medium
RQ-3 Unmeasured default recall (search.py:63) Low
SI-1 Health score not trended (lint.py:917) Medium
SI-2 No link-graph metrics (db.py:188) Medium
SI-3 No index-coverage metric Low
SI-4 Unvalidated health-score weights (lint.py:912) Low
TI-1 No coverage gate Medium
TI-2 No golden snapshots of generated pages Medium
TI-3 No property-based tests Medium
TI-4 No e2e CLI tests Low
TI-5 No React component tests Low
TI-6 No performance benchmarks Low
TI-7 No content/retrieval evals in CI High

Glossary

Relationship to planned work

This audit builds on rather than re-proposes existing designs: