Totally agree. We've found that a lot of "agent failures" trace back to assumpti...

Totally agree. We've found that a lot of "agent failures" trace back to assumptions, bad agent-decisions, or bloat buried in the context, stuff that makes perfect sense to the dev who built it when following the happy path, but can so easily fall apart in real-world scenarios.

We've been working on a way to test this more systematically by simulating full conversations with agents and surfacing the exact point where things go off the rails. Kind of like unit tests, but for context, behavior, and other ai jank.

Full disclosure, I work at the company building this, but the core library is open source, free to use, etc. https://github.com/langwatch/scenario