Behavioral evaluations
How should we evaluate agent behavior beyond pure task success? We are exploring comprehensive behavioral test suites alongside causal or mechanistic explanations to understand exactly why agents behave as they do.
Advancing the scientific study of how AI agents do and should behave.
The dominant paradigm in AI agent research is heavily capability-centric. Agents are primarily evaluated on what they achieve, and progress is measured by rigidly closing the gap between observed and desired performance. Though this has undoubtedly driven remarkable advances, it represents an incomplete account of agency that completely sidesteps a critical question: how do agents achieve what they achieve? A system that achieves a goal through opaque, brittle, or socially harmful processes raises deep concerns that traditional test metrics simply do not capture.
We propose a workshop centered on this complementary question, which we refer to as the study of agent behavior. This encompasses the full range of both observable and latent processes governing an agent's actions, including its decision-making strategies, its interaction patterns, its internal representations, and its responses to interventions.
How should we evaluate agent behavior beyond pure task success? We are exploring comprehensive behavioral test suites alongside causal or mechanistic explanations to understand exactly why agents behave as they do.
Covering rigorous studies of behavior generation, social simulation, and complex multi-agent dynamics where agentic behavior is inherently relational and interactive.
Focusing on highly actionable methods like post-training, environment design, and structural constraints needed to reliably steer agents toward strictly desired behaviors.
How do behavioral properties intersect with accountability? We investigate how to make this entire behavioral perspective deeply actionable for both policymakers and deployment practitioners.
Developing the fundamental architectures and underlying models intrinsically designed to capture, generate, or simulate diverse AI behaviors.
However, we strongly welcome any submission that advances the understanding of AI agents through a behavioral lens, including bold work that bridges or directly challenges these established categories.
We are excited to welcome leaders in this space from across disciplines like behavioral economics, human-AI interaction, computing, and policy.
We invite contributions that advance the scientific study of agent behavior across a wide range of topics and methodologies.
All submissions should be made via OpenReview.
Submission deadline
June 23, 2026 (AoE)
Submissions close in
We solicit non-archival papers (4–9 pages long) formatted in the standard COLM template.
Submissions undergo double-blind peer review. Preprints and concurrent submissions are explicitly encouraged.
We seek proposals for new benchmarks to evaluate frontier AI agent behavior. Submissions should use this LaTeX template, and be 1-2 pages long. The template also contains more information about the expected format and content.
NOTE: this track is for benchmark proposals only (no implementation or results needed at this stage).
If selected, we will provide credits for running benchmarks, a harness for building the benchmarks, and support the creators in implementing an open-source version.
We invite creators of accepted benchmarks to collaborate towards a large-scale agent behavior evaluation suite and paper.
🏆 We plan to give one Best Paper Award along with several oral presentation slots.

MIT
Website

Stanford University
Website

Georgia Institute of Technology
Website

UC Berkeley
Website

MIT
Website

ETH Zürich
Website

Amazon
Website

Amazon
Website

MIT
Website

Dartmouth College
Website