Signals¶
recsys/signals/ turns raw events into the user model. Both modules are pure, no
IO, fully unit/property tested.
engagement.py, continuous strength¶
Three public functions, no class:
| Function | Signature (abridged) | Does |
|---|---|---|
estimate_reading_time |
(word_count, has_image, cfg) -> float |
Seconds to consume content. |
engagement_strength |
(*, dwell_seconds, est_reading_time, end_reason, visits, survey_rating, cfg) -> float |
Continuous blend in ~[-1,1]. |
classify_outcome |
(strength, cfg) -> Outcome |
Threshold → positive / negative / neutral. |
dwell_ratio = min(dwell / est_reading_time, dwell_cap) / dwell_cap # [0,1]
completion = {next_button:1.0, link:0.6, close_button:0.0, abandon:-0.5}
revisit = 1 - exp(-visits / 2)
survey = (rating - 3) / 2 # 1..5 → [-1,1]
strength = wd·dwell_ratio + wc·completion + wr·revisit + ws·survey
Weights wd, wc, wr, ws come from RecConfig.engagement. This replaces the legacy binary
dwell >= estimate with a graded signal, partial reads, abandons, and revisits all move
the needle.
Property tested
Dwell monotonicity (more dwell ⇒ not-lower strength), abandon ⇒ negative contribution,
survey extremes map to ±1.
signal_builder.py, fold into UserSignals¶
flowchart TD
ev["Sequence[InteractionEvent]"] --> agg["aggregate_views()"]
agg --> va["dict[content_id → ViewAggregate]<br/>(dwell paired, visits, end_reason,<br/>last_ts, survey_rating)"]
va --> loop["per content"]
loop --> est["estimate_reading_time"]
loop --> str["engagement_strength"]
str --> cls["classify_outcome"]
str --> dec["recency decay<br/>w = strength · 0.5^(age/half_life)"]
cls -->|positive| pos["positives[cid] = w"]
cls -->|negative| neg["negatives[cid] = w"]
imp["impressions never viewed"] --> sn["soft negatives<br/>w = soft_neg · decay"]
str --> taff["tag_affinity += tag.weight · w"]
pos --> tv["taste_vector =<br/>L2-norm centroid of positive vecs"]
demo["demographics"] --> daff["person_who:* affinity"]
pos & neg & sn & taff & tv & daff --> us["UserSignals"]
ViewAggregate¶
All views of one content folded together, content_id, dwell_seconds, visits,
end_reason, last_ts, survey_rating. aggregate_views is robust to path-B async
(separate START/END events) and to sources that already carry explicit dwell_seconds.
build_user_signals¶
build_user_signals(*, user_id, events, contents, vectors, now, cfg, demographics=None)
-> UserSignals
Key behaviors:
- Recency decay with
half_life_days: recent engagement weighs more. - Soft negatives: content impressed (shown) but never viewed becomes a weak negative,
scaled by
soft_negative_weight × decay. Teaches the model what the user skipped. - Tag affinity: accumulates
facet:label → weightfrom engaged content's tags, scaled by engagement weight. - Taste vector: L2-normalized centroid of positively-engaged content vectors; the query vector for semantic recall.
- Demographic affinity: survey/demographics map to
person_who:*facets for cold start (e.g. age bucket →person_who:child).
This single function is the brain. The online updater rebuilds from the event buffer each refresh so there is exactly one definition of "the user model", see Orchestration and the serving model.