Vector Embeddings¶

The second content representation, complementing the Knowledge Graph: a numeric space where proximity means similar meaning.

The Knowledge Graph encodes explicit relationships, but not every connection is written down. Vector embeddings capture patterns from language and context, surfacing implicit relations the graph may miss. They let the engine retrieve content through distributional similarity, not just direct links: diaries that speak of deportation in different words, testimonies describing parallel experiences, images that co-occur with similar narratives.

The intuition: items with related meaning land near each other, unrelated ones land far apart. So "find more like this" works on meaning, and the system can bridge vocabulary, languages, and modalities. A check of the WO2 thesaurus embedding showed "Auschwitz" sitting closest to other camp names (Treblinka, Gross-Rosen, Dachau, Sachsenhausen), confirming the space is historically sensible.

Formally an embedding is a function that maps an item to a vector:

\[ f_\theta : \mathcal{X} \rightarrow \mathbb{R}^d \]

Similarity between two items is the cosine of their vectors:

\[ \mathrm{sim}(x,y) = \frac{f_\theta(x)\cdot f_\theta(y)}{\lVert f_\theta(x)\rVert\,\lVert f_\theta(y)\rVert} \]

Encoder families the engine can draw on and adapt:

Family	Examples	Use
Text-only	word2vec, GloVe, BERT, MPNet, Sentence-BERT	sentence and document vectors
Cross-lingual	multilingual BERT, LaBSE, multilingual Sentence-BERT	align languages in one space
Multimodal	CLIP and similar dual encoders	align image and text

At corpus scale, vectors are stored in an Approximate Nearest Neighbour (ANN) index such as HNSW (Hierarchical Navigable Small World) for millisecond lookup. In this system this is the Qdrant vector store.

The deployed prototype used spaCy en_core_web_md (300-dim GloVe). The production code uses sentence-transformers/all-MiniLM-L6-v2 via fastembed, behind the EmbeddingModel port (see Adapters).

Math rendering

Formulas use LaTeX. They read: f maps an item to a d-dimensional vector; similarity is cosine of the two vectors.