Karpathy-style LLM Wikis for the Hermes Agent — persistent, compounding knowledge bases curated by AI agents

Media Ingestion Design

Status: decisions locked 2026-06-07 (deep-research pass + design grill). This document is the binding decision record for the SPEC’s deferred “Media Processing Skills + Chunking” phase and the artifact PR0 builds against.

Method: a 108-agent deep-research pass (26 sources, 126 claims extracted, 25 adversarially verified — 21 confirmed / 4 refuted), followed by a full design walk resolving each branch in dependency order. Verified claims are cited; judgment calls are marked as such.


Decision Record

D1 — Architecture: split on the extraction/interpretation line

Principle amendment: the core pipeline is “deterministic, or version-stamped extraction; interpretation is always attributed to a model identity.” Recorded in CONTEXT.md.

D2 — Derived artifact tier

derived/<modality>/<source-id>/ holds transcripts, keyframes, extracted markdown, OCR text — plus a manifest.json per artifact set:

{"tool": "whisperx", "version": "3.1.5", "model_id": "large-v2",
 "input_sha256": "…", "created": "2026-06-07T00:00:00Z"}

Git-tracked, written only by processors, never hand-edited. Semantics: cached extraction with stamped provenance — not a projection (no auto-rebuild; a future derived_stale lint check can flag input-sha drift). Keyframes capped (default 24 frames, ~100KB target each, overridable via skill config).

D3 — Dependency delivery: optional extras + preflight-retain

D4 — Large media: two-tier storage

Principle amendment: for large media, provenance consciously degrades from bytes-in-git to fingerprint-in-git. Recorded in CONTEXT.md.

D5 — Eval lanes and the micro-corpus

Lane Marker Cadence Contents
Plumbing eval (existing) CI, every PR Processors run with stub extractors returning committed golden derived-sets (“extractor replay”). Asserts manifest schema, storage tiering, needs-deps retention, derived-page structure, citation anchor format. No model downloads; fully deterministic.
Extraction eval_media (new) Weekly + pre-release + on-demand Real tools on the micro-corpus: WhisperX (tiny/base) → jiwer WER ≤ threshold vs golden transcript; DER ≤ threshold (pyannote.metrics); PySceneDetect scene-count exact; PDF parser vs golden extraction (edit-distance threshold). Thresholds absorb hardware nondeterminism.
Interpretation eval_llm (existing) Scheduled Caption faithfulness: FaithScore-style claim-decomposition (reference-free — decompose caption into atomic facts, verify each against the image) + CLIPScore as a cheap secondary signal (documented weaknesses: negation, long captions — never a sole gate).

Micro-corpus: committed, < 5MB total, CC0/public-domain with a LICENSES file — ~15s speech WAV, ~10s two-scene MP4, 3 PNGs (chart/screenshot/photo), a 2-page PDF with a table, a synthetic social-post HTML. Golden derived-sets double as the stub-extractor payloads — one corpus, two lanes.

D6 — Skill surface: one skill, progressive disclosure, one new kind

D7 — Provenance anchors

Anchors live in the derived artifacts as stable headings; citations remain ordinary relative-link provenance markers with a human-readable position:

No lint or schema changes — the broken-link checker already resolves relative links; the convention is prose in the modality reference files and asserted by the CI plumbing lane.

Phase 2: structured evidence: frontmatter spans ({source, t0, t1}) once a dashboard player exists to consume them.

D8 — YouTube: notes by default, captions opt-in, never AV

Verified policy wall: YouTube Developer Policies prohibit downloading/caching/storing AV copies without written approval (§III.E.1.a verbatim) and cap most stored API data at 30-day retention — both incompatible with append-only raw/. The claim that scraping is categorically banned was refuted (0-3) — unsettled, not cleared.

Config wiki.media.youtube:

Justification is quantified: reference rot hits 1 in 5 STM articles, 7 in 10 among those with web references (Klein et al., PLOS ONE 2014 — headline figures verified 3-0).

Phase 2: true WARC page capture (browsertrix-class tooling; the WARC-GPT pattern demonstrates WARC-backed knowledge bases with provenance).

D10 — PDF parser: license pre-filter, then bake-off

D11 — Build order

All seven phases shipped 2026-06-07 (v0.5.0 – v0.11.0); per-phase adaptations are recorded inline above and in the bake-off/adaptation notes.

PR Scope
0 — Foundations derived/ tier + manifests (D2) · two-tier storage, MAX_MEDIA_BYTES, keep_originals (D4) · needs-deps retention (D3) · media SKILL_KIND + wiki-media-ingestion scaffold (D6) · micro-corpus + stub-extractor CI lane + eval_media marker/workflow (D5) · media classifier built-ins (extension/magic-byte) · CONTEXT/SPEC principle amendments (D1, D4)
1 — PDF Bake-off → pinned winner, processor, pdf.md, page-anchor goldens
2 — Images Aux-router captioning + OCR, images.md, FaithScore/CLIPScore lane
3 — Audio WhisperX processor, transcript anchors, WER/DER gates, audio.md
4 — Video PySceneDetect + composition of audio + image captioning, video.md
5 — Social Generic unfurl + Bluesky/Mastodon adapters, social.md (parallelizable with 1–4)
6 — YouTube oEmbed notes flow + captions flag, youtube.md

Each PR lands evals-first with its lane gates.


Phase-2 Register

  1. Structured evidence: frontmatter spans + dashboard media player (D7)
  2. True WARC page capture for social/web sources (D9) 2a. Generic OpenGraph unfurl for arbitrary URLs — deferred from PR5 (2026-06-07): enriching every HTML URL ingest with OG metadata changes existing article-ingestion semantics, so PR5 shipped the Bluesky/Mastodon/X adapters + read-and-note fallback only; X landed via its public oEmbed endpoint (no timestamp exposed)
  3. Cloud-ASR config escape hatch (D3)
  4. derived_stale lint check (input-sha drift against manifests) (D2)

Verified Tooling Summary

Modality Tool License Eval metric Verification
Audio/video ASR WhisperX (faster-whisper + wav2vec2 alignment + pyannote VAD/diarization) BSD-2 jiwer WER/CER (Apache-2.0); pyannote.metrics DER 3-0 ×4 claims

PR3 adaptation (2026-06-07): the shipped [audio] extra pins faster-whisper alone (MIT, CTranslate2 — no torch; PyAV bundles ffmpeg). Its native segment timestamps satisfy the D7 anchors; WhisperX’s additional value (forced alignment + pyannote diarization) drags the torch stack and HF-gated models, so it becomes the documented [audio-diarize] upgrade path where DER gates land. The eval_media WER threshold gate is deferred until a properly-licensed CC0 speech fixture joins the corpus (the generated tone can only smoke-test the transcription path) — tracked in evals/corpus/media/LICENSES.md. | Video scenes | PySceneDetect (Content/Threshold/Adaptive/Histogram detectors) | BSD | scene-count exact | 3-0 ×3 | | Image captions | via auxiliary vision router | — | FaithScore (reference-free claim decomposition) + CLIPScore (secondary) | 3-0 ×4 | | PDF | bake-off winner (docling prior) | MIT/Apache only | OmniDocBench-slice + golden edit-distance | benchmark 3-0; tools unverified |

Refuted claims — do not build on: OmniDocBench’s exact per-element metric mapping (1-2); “scraping prohibition rules out yt-dlp entirely” (0-3); the 13–17% live-web rot rates and <25% Memento-coverage figures (refuted/split — only the headline reference-rot figures are citable).

Known gaps (judgment, not evidence): platform-API mechanics for X/Bluesky/Mastodon/LinkedIn and skill-packaging specifics for heavy native deps had no surviving verified claims; D6/D9 shapes are engineering judgment within verified constraints.