Measures performance in agentic workflows, focusing on behaviors like tool use, planning, autonomy, and complex problem solving.
Evaluates models' ability to solve programming problems, including those requiring scientific and research domain knowledge.