Omeka Collection¶
The source of truth for content, the Bergen-Belsen archive managed in Omeka, and how items are pulled into the pipeline.
Omeka is the content management system curators use to hold the collection: items with titles, descriptions, media, and metadata. It is the authored, human side of the system, the place experts add and maintain material.
The engine never reads Omeka live at request time. Instead, content is extracted on a schedule into a representation tuned for retrieval (vectors + tags in Qdrant). This keeps two concerns apart:
- Curation: slow, careful, human, in Omeka.
- Retrieval: fast, machine, in the vector store.
Extraction is the bridge between them: read the authored truth, transform it into something the engine can search in milliseconds.
Tooling lives in the omeka-tools repo, a read-only Omeka Classic client plus
pipeline notebooks (extract → analyze → knowledge-graph → Qdrant) for the Bergen-Belsen
collection.
flowchart LR
api["Omeka REST API"] --> client["read-only client"]
client --> parquet["items (Parquet)"]
parquet --> filter["filter by id / word_count"]
filter --> ingest["ingest_content.py"]
ingest --> qd[("Qdrant")]
classDef store fill:#EFEAE0,stroke:#A8895B,color:#423D34;
class qd store;
- Extract: read items via the Omeka client, persist to Parquet.
- Filter: drop items below a word-count floor; select by id.
- Ingest:
ai_engine.search.ingest_contentencodes text and upsertsPointStructs with datetime + geo payload indices (see Search subsystem).
The item shape in code is ai_engine.common.Item (id, title, text, public_url,
locations, geo_metadata, time_metadata, files_url) with derived word_count /
image_url, rendered in the Code reference.
In the exhibition
Bergen-Belsen material is held in a private Omeka Classic collection. Much of it was
digitised in past projects; for the MEMORISE exhibition (Nov 2024 to May 2025) new
material was added and curated into mini-exhibitions for the 3D Panoramic Display.
omeka-tools is read-only (no write support yet).