Content Engine¶
The ingestion service. It formats documents from any source (Omeka, CSV, JSON) into the shared
ContentDocument contract, embeds them, and writes them to Qdrant. The AI Engine
API only reads what this service writes.
Prerequisites¶
- Qdrant running (or a reachable URL).
- The Omeka instance is already deployed by the content team; you only need its URL when syncing.
Configure¶
QDRANT_API_URL=http://localhost:6333
QDRANT_API_KEY= # empty for local
COLLECTION_NAME=omeka-items # must match the AI Engine API
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
INGEST_API_KEY= # set in production to protect write endpoints
Protect writes in production
With INGEST_API_KEY unset, POST /ingest and POST /sync/omeka are open. Set it and
send the value in the X-API-Key header.
Run without Docker¶
cd content-engine
pip install -e ".[api,qdrant,embed,omeka]"
content-engine # uvicorn on 0.0.0.0:8002
The extras: api (FastAPI + uvicorn), qdrant (client), embed (fastembed), omeka
(omeka-tools source adapter).
Run with Docker¶
The repo has no Dockerfile yet; a minimal one:
FROM python:3.12-slim
WORKDIR /app
COPY . .
RUN pip install -e ".[api,qdrant,embed,omeka]"
EXPOSE 8002
CMD ["content-engine"]
cd content-engine
docker build -t content-engine:local .
docker run -d --name content-engine \
-e QDRANT_API_URL=http://host.docker.internal:6333 \
-e COLLECTION_NAME=omeka-items \
-p 8002:8002 content-engine:local
Load content¶
# Ingest documents directly
curl -X POST http://localhost:8002/ingest \
-H "Content-Type: application/json" \
-d '[{"id":"841","title":"Forced labour","text":"...","content_type":"text_item",
"tags":[{"facet":"theme","label":"forced_labour","weight":1.0}]}]'
# Or sync from the configured Omeka instance (background)
curl -X POST http://localhost:8002/sync/omeka \
-H "Content-Type: application/json" -d '{"max_items":500}'
The first ingest creates the Qdrant collection and its payload indexes.
Verify¶
curl http://localhost:8002/health # {"status":"ok"}
curl http://localhost:6333/collections/omeka-items # collection now exists
Interactive reference: the Content Manager API.
Production¶
- Run as a long-lived Deployment if you ingest at runtime, or as a CronJob if you only
periodically
sync/omeka. - Always set
INGEST_API_KEYand keep/ingestoff the public internet. - Keep
COLLECTION_NAMEandEMBEDDING_MODELidentical to the AI Engine API.
Kubernetes¶
If you ingest at runtime, run it as a Deployment behind a ClusterIP Service; if you only periodically sync from Omeka, a CronJob is cleaner. Inject config from a Secret/ConfigMap.
apiVersion: apps/v1
kind: Deployment
metadata: { name: content-engine }
spec:
replicas: 1
selector: { matchLabels: { app: content-engine } }
template:
metadata: { labels: { app: content-engine } }
spec:
containers:
- name: content-engine
image: registry.example.com/memorise/content-engine:1.0
ports: [{ containerPort: 8002 }]
envFrom:
- secretRef: { name: content-engine-secrets } # INGEST_API_KEY, QDRANT_API_KEY
- configMapRef: { name: content-engine-config } # QDRANT_API_URL, COLLECTION_NAME, EMBEDDING_MODEL
readinessProbe: { httpGet: { path: /health, port: 8002 } }
---
apiVersion: v1
kind: Service
metadata: { name: content-engine }
spec:
selector: { app: content-engine }
ports: [{ port: 8002 }]
Periodic sync instead of a long-lived service:
apiVersion: batch/v1
kind: CronJob
metadata: { name: content-engine-sync }
spec:
schedule: "0 3 * * *"
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: sync
image: registry.example.com/memorise/content-engine:1.0
command: ["sh","-c","curl -fsS -X POST http://content-engine:8002/sync/omeka -H 'Content-Type: application/json' -d '{}'"]
In-cluster, set QDRANT_API_URL=http://qdrant:6333.