Code Reference, Recsys package¶

Auto-generated from source by mkdocstrings. Signatures, type annotations, fields, and docstrings are rendered directly from ai_engine.recsys.

Contracts¶

Models¶

models ¶

Tag ¶

Bases: BaseModel

One expert tag on a piece of content. facet is a taxonomy dimension (e.g. 'theme_what', 'person_who.age_group'); label the value.

Content ¶

Bases: BaseModel

Normalized item. Supersedes the loose Qdrant payload dicts.

InteractionEvent ¶

Bases: BaseModel

Canonical event. EVERY source (RudderStack/PostHog/Postgres) normalizes to this.

UserSignals ¶

Bases: BaseModel

THE USER MODEL. Everything the recommender needs about a user, derived from events.

Enums¶

enums ¶

EndReason ¶

Bases: str, Enum

How a content view ended (RudderStack CONTENT_VIEW_ENDED.details.reason).

Config¶

config ¶

EngagementWeights ¶

Bases: BaseModel

How much each behavioral signal contributes to engagement strength.

FusionWeights ¶

Bases: BaseModel

How much each scorer contributes to the final fused score. Each scorer -> [0,1].

RecConfig ¶

Bases: BaseModel

All tunables in one typed place so tests pin behavior by passing a config.

Ports¶

ports ¶

EmbeddingModel ¶

Bases: Protocol

Text -> vector. Real impl = fastembed; test fake = deterministic.

EventSource ¶

Bases: Protocol

Raw user data (RudderStack/PostHog/Postgres) -> canonical events.

The adapter is responsible for normalization, so downstream logic never sees source-specific shapes. In the online (path B) setup this is a Redis-backed hot buffer fed by the ingestion webhook; in batch it is a warehouse query.

DemographicsProvider ¶

Bases: Protocol

Supplies a user's survey demographics (age/gender/nationality) for the cold-start tag bridge. Source is pluggable: Postgres visitor table, survey events, or a static map. Returns {} when unknown.

UserModelStore ¶

Bases: Protocol

Materialized user model (UserSignals) for online serving.

Path B: the ingestion webhook updates this on each event so a rec request is a fast read, not a rebuild. The in-memory fake / recompute-backed impl make this a drop-in: swap to Redis without touching the recommender.

ContentStore ¶

Bases: Protocol

Content structure + vectors (Qdrant). Test fake = in-memory.

Signals¶

engagement¶

engagement ¶

Pure engagement scoring. No IO. Input = plain numbers, output = float/enum.

These functions are the easiest thing to validate: feed known numbers, assert the behavior the design promises (longer dwell -> higher, abandon -> negative, ...).

estimate_reading_time ¶

estimate_reading_time(
    word_count: int, has_image: bool, cfg: RecConfig
) -> float

Seconds a typical visitor needs to consume this content.

Source code in ai-engine\src\ai_engine\recsys\signals\engagement.py

def estimate_reading_time(word_count: int, has_image: bool, cfg: RecConfig) -> float:
    """Seconds a typical visitor needs to consume this content."""
    base = word_count / cfg.reading_speed_wps if cfg.reading_speed_wps > 0 else 0.0
    if has_image:
        base += cfg.img_extra_time
    return base

engagement_strength ¶

engagement_strength(
    *,
    dwell_seconds: Optional[float],
    est_reading_time: float,
    end_reason: Optional[EndReason],
    visits: int,
    survey_rating: Optional[float],
    cfg: RecConfig,
) -> float

Continuous engagement in roughly [-1, 1]. Weighted blend of behavioral signals.

Source code in ai-engine\src\ai_engine\recsys\signals\engagement.py

def engagement_strength(
    *,
    dwell_seconds: Optional[float],
    est_reading_time: float,
    end_reason: Optional[EndReason],
    visits: int,
    survey_rating: Optional[float],
    cfg: RecConfig,
) -> float:
    """Continuous engagement in roughly [-1, 1]. Weighted blend of behavioral signals."""
    w = cfg.engagement
    completion = _COMPLETION.get(end_reason, 0.0)
    strength = (
        w.dwell * _dwell_ratio(dwell_seconds, est_reading_time, cfg)
        + w.completion * completion
        + w.revisit * _revisit(visits)
        + w.survey * _survey(survey_rating)
    )
    return strength

signal_builder¶

signal_builder ¶

Pure construction of the USER MODEL (UserSignals) from events + content structure.

events (+ content tags/vectors) -> UserSignals. No IO: the caller fetches content and vectors and passes them in. now is passed in too, so the function is fully deterministic and testable.

ViewAggregate `dataclass` ¶

ViewAggregate(
    content_id: str,
    dwell_seconds: Optional[float] = None,
    visits: int = 0,
    end_reason: Optional[EndReason] = None,
    last_ts: Optional[datetime] = None,
    survey_rating: Optional[float] = None,
)

All views of one content folded together.

aggregate_views ¶

aggregate_views(
    events: Sequence[InteractionEvent],
) -> dict[str, ViewAggregate]

Group events by content_id and pair start/end into dwell.

Robust to path B (start and end arrive as separate webhook events) and to sources that already carry dwell_seconds on the end event.

Source code in ai-engine\src\ai_engine\recsys\signals\signal_builder.py

def aggregate_views(events: Sequence[InteractionEvent]) -> dict[str, ViewAggregate]:
    """Group events by content_id and pair start/end into dwell.

    Robust to path B (start and end arrive as separate webhook events) and to
    sources that already carry dwell_seconds on the end event.
    """
    by_content: dict[str, list[InteractionEvent]] = {}
    for e in events:
        if e.content_id is None:
            continue
        by_content.setdefault(e.content_id, []).append(e)

    out: dict[str, ViewAggregate] = {}
    for cid, evs in by_content.items():
        agg = ViewAggregate(content_id=cid)
        starts = [e for e in evs if e.event == _VIEW_START]
        ends = [e for e in evs if e.event == _VIEW_END]
        agg.visits = max(len(starts), 1)

        explicit = [e.dwell_seconds for e in evs if e.dwell_seconds is not None]
        if explicit:
            agg.dwell_seconds = max(explicit)
        elif starts and ends:
            span = max(e.ts for e in ends) - min(e.ts for e in starts)
            agg.dwell_seconds = max(span.total_seconds(), 0.0)

        if ends:
            last_end = max(ends, key=lambda e: e.ts)
            agg.end_reason = last_end.end_reason

        agg.last_ts = max(e.ts for e in evs)

        ratings = [
            float(e.survey_answers["rating"])
            for e in evs
            if isinstance(e.survey_answers, dict) and "rating" in e.survey_answers
        ]
        if ratings:
            agg.survey_rating = ratings[-1]

        out[cid] = agg
    return out

build_user_signals ¶

build_user_signals(
    *,
    user_id: str,
    events: Sequence[InteractionEvent],
    contents: dict[str, Content],
    vectors: dict[str, Vector],
    now: datetime,
    cfg: RecConfig,
    demographics: Optional[dict] = None,
) -> UserSignals

Fold events + content structure into the user model.

Source code in ai-engine\src\ai_engine\recsys\signals\signal_builder.py

def build_user_signals(
    *,
    user_id: str,
    events: Sequence[InteractionEvent],
    contents: dict[str, Content],
    vectors: dict[str, Vector],
    now: datetime,
    cfg: RecConfig,
    demographics: Optional[dict] = None,
) -> UserSignals:
    """Fold events + content structure into the user model."""
    aggs = aggregate_views(events)

    positives: dict[str, float] = {}
    negatives: dict[str, float] = {}
    tag_affinity: dict[str, float] = {}
    tag_aversion: dict[str, float] = {}

    engaged_ids = set(aggs.keys())

    for cid, agg in aggs.items():
        content = contents.get(cid)
        est = estimate_reading_time(
            content.word_count if content else 0,
            content.has_image if content else False,
            cfg,
        )
        strength = engagement_strength(
            dwell_seconds=agg.dwell_seconds,
            est_reading_time=est,
            end_reason=agg.end_reason,
            visits=agg.visits,
            survey_rating=agg.survey_rating,
            cfg=cfg,
        )
        outcome = classify_outcome(strength, cfg)
        decay = _decay(agg.last_ts, now, cfg.half_life_days)

        if outcome == Outcome.positive:
            positives[cid] = max(strength, 0.0) * decay
            if content:
                for tag in content.tags:
                    tag_affinity[tag.key] = tag_affinity.get(tag.key, 0.0) + positives[cid] * tag.weight
        elif outcome == Outcome.negative:
            negatives[cid] = abs(strength) * decay
            if content:                       # the THEMES of disliked content -> aversion
                for tag in content.tags:
                    tag_aversion[tag.key] = tag_aversion.get(tag.key, 0.0) + negatives[cid] * tag.weight

    # soft negatives: shown in an impression set but never engaged
    for e in events:
        for imp in e.impressions:
            if imp not in engaged_ids and imp not in positives:
                pen = cfg.soft_negative_weight * _decay(e.ts, now, cfg.half_life_days)
                negatives[imp] = max(negatives.get(imp, 0.0), pen)

    # survey + identify events -> demographics + person_who/persona affinity
    from ..survey import DEMOGRAPHIC_EVENTS, survey_affinity, extract_demographics
    survey_demo: dict = {}
    for e in events:
        if e.event in DEMOGRAPHIC_EVENTS and e.survey_answers:
            survey_demo.update(extract_demographics(e.survey_answers))
            for key, w in survey_affinity(e.survey_answers).items():
                tag_affinity[key] = tag_affinity.get(key, 0.0) + w

    # explicit demographic affinity (cold-start seed; person_who facets)
    demographics = {**survey_demo, **(demographics or {})}
    if demographics:
        for key, w in _demographic_affinity(demographics).items():
            tag_affinity[key] = tag_affinity.get(key, 0.0) + w

    # taste vector = weighted centroid of positively-engaged content vectors
    taste_vector: Optional[list[float]] = None
    acc: Optional[list[float]] = None
    for cid, w in positives.items():
        v = vectors.get(cid)
        if not v:
            continue
        if acc is None:
            acc = [0.0] * len(v)
        for i, x in enumerate(v):
            acc[i] += w * x
    if acc is not None and any(acc):
        taste_vector = _normalize_unit(acc)

    # canonicalize keys to lowercase (merge case variants) so content-derived and
    # demographic-derived affinities line up regardless of taxonomy casing.
    folded: dict[str, float] = {}
    for k, v in tag_affinity.items():
        folded[k.lower()] = folded.get(k.lower(), 0.0) + v
    tag_affinity = folded

    # normalize tag affinity to [0, 1] by max
    if tag_affinity:
        mx = max(tag_affinity.values())
        if mx > 0:
            tag_affinity = {k: v / mx for k, v in tag_affinity.items()}

    # same fold + normalize for aversion (negatively-engaged themes)
    folded_av: dict[str, float] = {}
    for k, v in tag_aversion.items():
        folded_av[k.lower()] = folded_av.get(k.lower(), 0.0) + v
    tag_aversion = folded_av
    if tag_aversion:
        mxa = max(tag_aversion.values())
        if mxa > 0:
            tag_aversion = {k: v / mxa for k, v in tag_aversion.items()}

    # sequence: order viewed content by most-recent interaction first
    ordered = sorted(aggs.items(), key=lambda kv: (kv[1].last_ts or now), reverse=True)
    recent_views = [cid for cid, _ in ordered]
    recency_vector = vectors.get(recent_views[0]) if recent_views else None

    return UserSignals(
        user_id=user_id,
        positives=positives,
        negatives=negatives,
        viewed=sorted(aggs.keys()),          # full view history (any outcome) for dedup
        recent_views=recent_views,           # sequence awareness
        tag_affinity=tag_affinity,
        tag_aversion=tag_aversion,
        taste_vector=taste_vector,
        recency_vector=recency_vector,
        demographics=demographics or {},
    )

Ranking¶

scorers¶

scorers ¶

Pure scorers. CONTRACT: every scorer returns a value in [0, 1].

That contract is what makes the weighted sum in fusion valid without rescaling.