Skip to content

Content Engine

The ingestion service. It formats documents from any source (Omeka, CSV, JSON) into the shared ContentDocument contract, embeds them, and writes them to Qdrant. The AI Engine API only reads what this service writes.

Prerequisites

  • Qdrant running (or a reachable URL).
  • The Omeka instance is already deployed by the content team; you only need its URL when syncing.

Configure

QDRANT_API_URL=http://localhost:6333
QDRANT_API_KEY=                 # empty for local
COLLECTION_NAME=omeka-items     # must match the AI Engine API
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
INGEST_API_KEY=                 # set in production to protect write endpoints

Protect writes in production

With INGEST_API_KEY unset, POST /ingest and POST /sync/omeka are open. Set it and send the value in the X-API-Key header.

Run without Docker

cd content-engine
pip install -e ".[api,qdrant,embed,omeka]"
content-engine                  # uvicorn on 0.0.0.0:8002

The extras: api (FastAPI + uvicorn), qdrant (client), embed (fastembed), omeka (omeka-tools source adapter).

Run with Docker

The repo has no Dockerfile yet; a minimal one:

FROM python:3.12-slim
WORKDIR /app
COPY . .
RUN pip install -e ".[api,qdrant,embed,omeka]"
EXPOSE 8002
CMD ["content-engine"]
cd content-engine
docker build -t content-engine:local .
docker run -d --name content-engine \
  -e QDRANT_API_URL=http://host.docker.internal:6333 \
  -e COLLECTION_NAME=omeka-items \
  -p 8002:8002 content-engine:local

Load content

# Ingest documents directly
curl -X POST http://localhost:8002/ingest \
  -H "Content-Type: application/json" \
  -d '[{"id":"841","title":"Forced labour","text":"...","content_type":"text_item",
        "tags":[{"facet":"theme","label":"forced_labour","weight":1.0}]}]'

# Or sync from the configured Omeka instance (background)
curl -X POST http://localhost:8002/sync/omeka \
  -H "Content-Type: application/json" -d '{"max_items":500}'

The first ingest creates the Qdrant collection and its payload indexes.

Verify

curl http://localhost:8002/health                    # {"status":"ok"}
curl http://localhost:6333/collections/omeka-items   # collection now exists

Interactive reference: the Content Manager API.

Production

  • Run as a long-lived Deployment if you ingest at runtime, or as a CronJob if you only periodically sync/omeka.
  • Always set INGEST_API_KEY and keep /ingest off the public internet.
  • Keep COLLECTION_NAME and EMBEDDING_MODEL identical to the AI Engine API.

Kubernetes

If you ingest at runtime, run it as a Deployment behind a ClusterIP Service; if you only periodically sync from Omeka, a CronJob is cleaner. Inject config from a Secret/ConfigMap.

apiVersion: apps/v1
kind: Deployment
metadata: { name: content-engine }
spec:
  replicas: 1
  selector: { matchLabels: { app: content-engine } }
  template:
    metadata: { labels: { app: content-engine } }
    spec:
      containers:
        - name: content-engine
          image: registry.example.com/memorise/content-engine:1.0
          ports: [{ containerPort: 8002 }]
          envFrom:
            - secretRef: { name: content-engine-secrets }   # INGEST_API_KEY, QDRANT_API_KEY
            - configMapRef: { name: content-engine-config }  # QDRANT_API_URL, COLLECTION_NAME, EMBEDDING_MODEL
          readinessProbe: { httpGet: { path: /health, port: 8002 } }
---
apiVersion: v1
kind: Service
metadata: { name: content-engine }
spec:
  selector: { app: content-engine }
  ports: [{ port: 8002 }]

Periodic sync instead of a long-lived service:

apiVersion: batch/v1
kind: CronJob
metadata: { name: content-engine-sync }
spec:
  schedule: "0 3 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: sync
              image: registry.example.com/memorise/content-engine:1.0
              command: ["sh","-c","curl -fsS -X POST http://content-engine:8002/sync/omeka -H 'Content-Type: application/json' -d '{}'"]

In-cluster, set QDRANT_API_URL=http://qdrant:6333.