Test algorithms with simulated users before deployment
The Problem
Recommendation models and ranking changes go live with zero behavioral coverage. A/B tests take weeks. By then, the damage is done.
They validate output metrics, not behavior. The metrics may look good while real user journeys quietly degrade.
Search, recommendations, and agents evolve over sessions through queries, clicks, and feedback loops. Static evaluation misses how systems actually behave over time.
The Solution
Distill real user sessions into representative behavioral patterns that capture how people search, click, and adapt.
Test how experiences evolve across sessions. Surface degradation in ranking quality, navigation paths, and feedback loops before release.
See how changes influence engagement and outcomes without exposing real users.
Enter experiments with stronger confidence and fewer unknowns.
Built For
Product Leaders
ML Engineers
Offline metrics like NDCG do capture ranking quality degradation. What they don't capture are the multi-step behavioral failures that only surface when a user simulator replays full sessions. (Jannach et al. 2019; Castells 2022 survey of RS evaluation.)
| Failure mode | NDCG detects? | Simulare detects? |
|---|---|---|
| Item ranking quality degradation | YES | YES |
| Filter bubble / diversity collapse | NO | YES |
| Price shock (user abandonment) | NO | YES |
| Session-cascade click fatigue | NO | YES |
| Novelty / serendipity degradation | NO | Partial |
Session-level behavioral replay for pre-deployment ML evaluation has academic roots in Li et al. 2011 and Gilotte et al. 2018. The gap is drop-in production integration - no RL environment setup, no internal research team required. The differentiation is product and integration, not algorithmic novelty.
| Tool | Approach | Gap vs Simulare |
|---|---|---|
| RecSim (Google 2019) | RL user environment | Requires RL setup; no CI/CD integration; no production API |
| RecList (Tagliabue 2022) | Behavioral black-box tests | No user simulator - heuristic tests only; no session-level replay |
| Virtual-Taobao (Alibaba 2019) | GAN user simulator on Taobao logs | Alibaba-internal only; not productized; no open API |
| SimUSER (NTT 2025) | LLM-based session simulator | High latency (LLM calls per decision); no session-fidelity back-test |
| Arize / WhyLabs / Evidently | Post-deployment monitoring | Post-hoc only - reacts after users are exposed; no pre-deployment gate |
| Simulare | Session-replay + fitted behavior model + CI gate | Drop-in integration; validated against held-out conversion outcomes; no RL required |