May 20, 2026
Cohere launches open weights model Command A+ that achieves 37 on the Artificial Analysis Intelligence Index
The release of Command A+ places Cohere in line with Claude 4.5 Haiku on the Intelligence Index, and just above NVIDIA Nemotron 3 Super and Gemini 3.1 Flash-Lite.
Key Takeaways:
➤ Command A+ ranks first on AA-Omniscience Non-Hallucination at 86%, ~3 percentage points ahead of the next-best model. Its AA-Omniscience Accuracy is 9%, so the headline AA-Omniscience score lands at -4, demonstrating a similar archetype to Claude 4.5 Haiku, where the model knows its limits
➤ On Cohere’s API, Command A+ (~281 output tokens per second) is faster than several comparable open-weights and small to mid-sized proprietary models (e.g., GPT-5.4 nano, Claude 4.5 Haiku, and Grok 4.3), but still slower than Gemini 3.1 Flash-Lite Preview, which outputs 304 tokens per second
➤ Command A+ trails its peer set on scientific reasoning (HLE ~11%, GPQA Diamond ~76%) and on coding (Terminal-Bench Hard ~25%, SciCode ~38%), consistent with gaps on the hardest science and agentic coding benchmarks
➤ It supports visual reasoning and scores 63% on MMMU-Pro (between Claude 4.5 Haiku at 59% and GPT-5.4 nano (xhigh) at 65%)

In our pre-release testing, Command A+ performed strongly on speed for its intelligence, reaching 281 output tokens per second. This reflects higher intelligence and speed than models such as gpt-oss-120b, but sits behind the new Pareto frontier established by Gemini 3.5 Flash

Amongst comparable models, Command A+ is competitive on intelligence vs. output tokens to run the Artificial Analysis Intelligence Index

On AA-Omniscience, our knowledge and hallucination evaluation, Command A+ fits a positive archetype where it knows its limitations: although it has relative low performance on AA-Omniscience Accuracy, it also hallucinates the least on AA-Omniscience, resulting in a fairly strong relative headline AA-Omniscience score of -4

Breakdown of individual evaluations:

See Artificial Analysis for further details and benchmarks: https://artificialanalysis.ai/models/command-a-plus
Read the latest

Measuring time per task in AA-Briefcase
Agentic knowledge work can take frontier models over 20 minutes per task, as measured in AA-Briefcase, our new benchmark
June 24, 2026

Announcing the Artificial Analysis Speech to Speech Index
Announcing the Artificial Analysis Speech to Speech Index, our new synthesis metric for native Speech to Speech model quality, comprising of Big Bench Audio, Full Duplex Bench, and 𝜏-Voice
June 23, 2026

Announcing AA-Briefcase: a frontier knowledge work evaluation
AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files, combining rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality.
June 18, 2026