Simulare_

Test algorithms with simulated users before deployment

The Problem

ML changes ship blind

Recommendation models and ranking changes go live with zero behavioral coverage. A/B tests take weeks. By then, the damage is done.

Tests don't reflect real usage

They validate output metrics, not behavior. The metrics may look good while real user journeys quietly degrade.

User behavior is sequential

Search, recommendations, and agents evolve over sessions through queries, clicks, and feedback loops. Static evaluation misses how systems actually behave over time.

The Solution

Digital twins testing out the ML product

01

Summarize key trajectories

Distill real user sessions into representative behavioral patterns that capture how people search, click, and adapt.

02

Evaluate full user journeys

Test how experiences evolve across sessions. Surface degradation in ranking quality, navigation paths, and feedback loops before release.

03

Understand impact before release

See how changes influence engagement and outcomes without exposing real users.

04

Minimize risky exposure in A/B tests

Enter experiments with stronger confidence and fewer unknowns.

Built For

Teams shipping AI/ML products

Product Leaders

  • See user impact before release
  • Run safer, more informed experiments
  • Make decisions with clearer signals

ML Engineers

  • Validate changes before release
  • Integrate with existing systems and workflows
  • Run evaluations as part of the development cycle
what offline eval misses

Where standard metrics fall short

Offline metrics like NDCG do capture ranking quality degradation. What they don't capture are the multi-step behavioral failures that only surface when a user simulator replays full sessions. (Jannach et al. 2019; Castells 2022 survey of RS evaluation.)

Failure mode NDCG detects? Simulare detects?
Item ranking quality degradation YES YES
Filter bubble / diversity collapse NO YES
Price shock (user abandonment) NO YES
Session-cascade click fatigue NO YES
Novelty / serendipity degradation NO Partial
landscape

How Simulare fits the evaluation landscape

Session-level behavioral replay for pre-deployment ML evaluation has academic roots in Li et al. 2011 and Gilotte et al. 2018. The gap is drop-in production integration - no RL environment setup, no internal research team required. The differentiation is product and integration, not algorithmic novelty.

Tool Approach Gap vs Simulare
RecSim (Google 2019) RL user environment Requires RL setup; no CI/CD integration; no production API
RecList (Tagliabue 2022) Behavioral black-box tests No user simulator - heuristic tests only; no session-level replay
Virtual-Taobao (Alibaba 2019) GAN user simulator on Taobao logs Alibaba-internal only; not productized; no open API
SimUSER (NTT 2025) LLM-based session simulator High latency (LLM calls per decision); no session-fidelity back-test
Arize / WhyLabs / Evidently Post-deployment monitoring Post-hoc only - reacts after users are exposed; no pre-deployment gate
Simulare Session-replay + fitted behavior model + CI gate Drop-in integration; validated against held-out conversion outcomes; no RL required