Introducing OpenViking

Context is the hardest part of building a personal AI assistant. Not the AI model, not the infra — context. Knowing what to remember, how much detail to surface, and when to stop loading more.

My home lab setup runs through iujinwee-hermes, a Hermes-backed agent that takes Telegram messages and acts on the vault. The missing piece was memory — some way for the agent to recall relevant context across conversations without loading the entire vault into every prompt. That’s where OpenViking comes in.

What is OpenViking

OpenViking is a context database developed by ByteDance that unifies different types of memory into a single, queryable store.

It’s designed to slot in as a Hermes-compatible memory provider — meaning I can switch out to other memory providers without changing how the rest of the system is wired.

Value of OpenViking

Offers semantic retrieval with progressive context loading. You don’t get a flat blob of text dumped into the context window. You get layered results — surface-level summaries first, deeper content only when you ask for it.

Architecture of OpenViking

Dual-Storage Layer

OpenViking’s storage layer is deliberately minimal. It handles:

  • **AGFS Content Storage — L0/L1/L2 full content, multimedia files, relations
  • Vector Index Storage — URIs, vectors, metadata for index storage
Clear Separation of Concerns

The separation pays off in a few ways. The vector index stays lean — it never touches file content, just pointers — so memory pressure stays low even as the vault grows. AGFS owns the data; the index owns the lookups.

Single Data Source

There’s also no “search copy vs real copy” drift problem. Every retrieval path ends at the same AGFS read, so what the agent sees is always the canonical version of the content.

Query
  └─▶ Vector Index (refs only)
         └─▶ Candidate refs
               └─▶ AGFS (content fetch)
                     └─▶ Reranker
                           └─▶ Ranked results → Agent
Independent Scaling with Rust

And because the two layers scale independently, swapping AGFS for a faster implementation doesn’t break retrieval — and vice versa.

Worth noting: ByteDance has already done exactly this, rewriting AGFS in Rust (now RAGFS) for lower latency without touching the vector layer.

Three-Layer Information Model

The most interesting part of OpenViking is the L0–L1–L2 layering system for stored content.

LayerWhat it holdsWhen it’s loaded
L0High-level summary / titleAlways — minimal tokens
L1Key facts and structureOn relevance hit
L2Full contentOn explicit deep-load request

This matters because token budget is real. A retrieval step that always pulls L2 content is expensive and noisy. The layered model lets the agent start with L0 results, identify what’s actually relevant, then drill down — only loading L2 for the handful of items that actually matter.

Paying for tokens you’re not using is the slow drain that kills long-running agents. Progressive loading fixes it.

Two-Stage Retrieval

Retrieval works in two passes:

  1. Vector search — casts a wide net; pulls candidate results using embedding similarity
  2. Rerank — tightens the shortlist; scores candidates against the actual query for precision

The split matters.

Vector search is fast but imprecise — good at recall, weak at precision. Reranking flips that, at the cost of latency. Running both in sequence gets you the best of both: broad recall from the vector pass, accuracy from rerank.

Precision

Focus: Measure of Quality (and Performance) In the context of RAG, high precision means it returns MORE relevant than irrelevant results.

Recall

Focus: Measure of Sensitivity (or relevant Quantity) In the context of RAG, high recall means it returns MOST of the relevant results

Embedding Model

Right now, OpenViking is configured with Google’s gemini-embedding-2-preview for generating embeddings. It’s a solid default — handles long-context documents well and the quality of semantic similarity is noticeably better than older generation models.

The tradeoff is obvious: every embedding call goes out to Google’s API. For a home lab that’s supposed to own its compute, that’s a temporary concession.

Future Plans

Migrate to Local Self-hosted Models

Once my Mac Mini (M5) arrives, the plan is to migrate embeddings to a local self-hosted model via Ollama. Something like nomic-embed-text or mxbai-embed-large running on-device — no external API calls, no data leaving the home network.

The rest of the stack stays the same. OpenViking abstracts the embedding provider cleanly enough that swapping models means changing a config, not rewriting the retrieval pipeline.


Thanks for reading! Memory is one of those problems that sounds solved until you’re actually building with it — OpenViking is the first piece of infra here that genuinely feels like it was designed for agents, not retrofitted.