Simulation
Simulation lets you replay existing tasks with different configurations. You can test new models, adjust prompts, or modify settings without affecting production traffic.
Why Simulation Matters
Changing an agent in production is risky. A new model might perform better on some tasks but worse on others. A prompt tweak might fix one issue while introducing another. Without a way to test changes safely, you are forced to guess.
Simulation solves this by letting you re-run real tasks from your trace history. You see exactly how the new configuration would have handled actual user requests, complete with reward scores for comparison. This lets you make data-driven decisions before deploying changes.
What You Can Simulate
Simulation supports several types of configuration changes:
Model Changes
- Switch from GPT-4 to Claude or Gemini
- Test a newer version of the same model
- Compare performance across model families
Prompt Changes
- Modify the system prompt
- Add or remove instructions
- Test different prompt structures
Parameter Changes
- Adjust temperature or top-p settings
- Change max token limits
- Modify other model parameters
Tool Changes
- Add or remove available tools
- Modify tool descriptions
- Test tool parameter changes
How Simulation Works
Simulation replays a task using the original inputs but with your new configuration:
-
Select a Task: Choose a task from your trace history that you want to replay.
-
Configure Changes: Specify what you want to change: model, prompt, parameters, or tools.
-
Run Simulation: Marlo replays the task using the original user input but with your new configuration.
-
Compare Results: View the simulated output alongside the original, with reward scores for both.
Using Simulation
Access Simulation from the Marlo dashboard:
- Navigate to a task in your trace history.
- Click the Simulate button.
- Configure your changes in the simulation panel.
- Run the simulation and review the results.
You can run multiple simulations on the same task to compare several configurations at once.
What You Get
Each simulation produces:
- Simulated Output: The response your agent would have given with the new configuration.
- Simulated Reward: A reward score for the simulated output, computed using the same criteria as production.
- Comparison View: Side-by-side display of original and simulated outputs with their respective scores.
- Cost Estimate: Token usage and estimated cost for the simulated run.
Batch Simulation
For larger evaluations, you can run simulations across multiple tasks at once:
- Select a set of tasks from your trace history (e.g., all tasks from the past week, or all tasks with low reward scores).
- Configure your changes.
- Run the batch simulation.
- Review aggregate results showing how many tasks improved, degraded, or stayed the same.
Batch simulation helps you understand the overall impact of a change before rolling it out to production.