Vector Embeddings¶
The second content representation, complementing the Knowledge Graph: a numeric space where proximity means similar meaning.
The Knowledge Graph encodes explicit relationships, but not every connection is written down. Vector embeddings capture patterns from language and context, surfacing implicit relations the graph may miss. They let the engine retrieve content through distributional similarity, not just direct links: diaries that speak of deportation in different words, testimonies describing parallel experiences, images that co-occur with similar narratives.
The intuition: items with related meaning land near each other, unrelated ones land far apart. So "find more like this" works on meaning, and the system can bridge vocabulary, languages, and modalities. A check of the WO2 thesaurus embedding showed "Auschwitz" sitting closest to other camp names (Treblinka, Gross-Rosen, Dachau, Sachsenhausen), confirming the space is historically sensible.
Formally an embedding is a function that maps an item to a vector:
Similarity between two items is the cosine of their vectors:
Encoder families the engine can draw on and adapt:
| Family | Examples | Use |
|---|---|---|
| Text-only | word2vec, GloVe, BERT, MPNet, Sentence-BERT | sentence and document vectors |
| Cross-lingual | multilingual BERT, LaBSE, multilingual Sentence-BERT | align languages in one space |
| Multimodal | CLIP and similar dual encoders | align image and text |
At corpus scale, vectors are stored in an Approximate Nearest Neighbour (ANN) index such as HNSW (Hierarchical Navigable Small World) for millisecond lookup. In this system this is the Qdrant vector store.
The deployed prototype used spaCy en_core_web_md (300-dim GloVe). The production code
uses sentence-transformers/all-MiniLM-L6-v2 via fastembed, behind the EmbeddingModel
port (see Adapters).
Math rendering
Formulas use LaTeX. They read: f maps an item to a d-dimensional vector; similarity is cosine of the two vectors.