Qwen3.5 small models: Everything you need to know

Alibaba has released 4 new Qwen3.5 models from 0.8B to 9B. The 9B (Reasoning, 32 on the Intelligence Index) is the most intelligent model under 10B parameters, and the 4B (Reasoning, 27) the most intelligent under 5B, but both use 200M+ output tokens to run the Intelligence Index

Alibaba has expanded the Qwen3.5 family with four smaller dense models: the 9B (Reasoning, 32 on the Intelligence Index), 4B (Reasoning, 27), 2B (Reasoning, 16), and 0.8B (Reasoning, 9). These complement the larger 397B, 27B, 122B A10B, and 35B A3B models released earlier this month. All models are Apache 2.0 licensed, support 262K context, include native vision support, and use the same unified thinking/non-thinking hybrid approach as the rest of the Qwen3.5 family

Key benchmarking results for the reasoning variants:

➤ The 9B and 4B are the most intelligent models at their respective size classes, ahead of all other models under 10B parameters. Qwen3.5 9B (32) scores roughly double the next closest models under 10B: Falcon-H1R-7B (16) and NVIDIA Nemotron Nano 9B V2 (Reasoning, 15). Qwen3.5 4B (27) outscores all of these despite having roughly half the parameters. Qwen3.5 2B (16) matches the Falcon-H1R-7B at 7B and exceeds the Nemotron Nano 9B V2 at 9B. All four of the small Qwen3.5 models push the Pareto frontier of the Intelligence vs. Total Parameters chart.

➤ The Qwen3.5 generation represents a material intelligence uplift over Qwen3 across all sub-10B model sizes, with larger gains at higher parameter counts. Comparing reasoning variants: Qwen3.5 9B (32) is 15 points ahead of Qwen3 VL 8B (17), the 4B (27) gains 9 points over Qwen3 4B 2507 (18), the 2B (16) is 3 points ahead of Qwen3 1.7B (estimated 13), and the 0.8B (9) gains 2.5 points over Qwen3 0.6B (6.5).

➤ All four models use 230-390M output tokens to run the Intelligence Index, significantly more than both larger Qwen3.5 siblings and Qwen3 predecessors. Qwen3.5 2B used ~390M output tokens, 4B used ~240M, 0.8B used ~230M, and 9B used ~260M. For context, the much larger Qwen3.5 27B used 98M and the 397B flagship used 86M. These token counts also exceed most frontier models: Gemini 3.1 Pro Preview (57M), GPT-5.1 high (69M), and GLM-5 Reasoning (109M). Only GPT-5.2 Codex xhigh (202M) is comparable, and the Qwen3.5 2B nearly doubles it

➤ AA-Omniscience is a relative weakness, with hallucination rates of 80-82% for the 4B and 9B. Qwen3.5 4B scores -57 on AA-Omniscience with a hallucination rate of 80% and accuracy of 12.8%. Qwen3.5 9B scores -56 with 82% hallucination and 14.7% accuracy. These are marginally better than their Qwen3 predecessors (Qwen3 4B 2507: -61, 84% hallucination, 12.7% accuracy), with the improvement driven primarily by lower hallucination rates rather than higher accuracy.

➤ The Qwen3.5 sub-10B models combine high intelligence with native vision at a scale previously unavailable. On MMMU-Pro (multimodal reasoning), Qwen3.5 9B scores 69.2% and 4B scores 65.4%, ahead of Qwen3 VL 8B (56.6%), Qwen3 VL 4B (52.0%), and Ministral 3 8B (46.0%). The Qwen3.5 0.8B scores 25.8%, which is notable for a sub-1B model

Other information:

➤ Context window: 262K tokens

➤ License: Apache 2.0.

➤ Quantization: Native weights are BF16. Alibaba has not released first-party GPTQ-Int4 quantizations for these small models, though they have for the larger models in the Qwen3.5 family released earlier (27B, 35B-A3B, 122B-A10B, 397B-A17B). In 4-bit quantization, the 9B requires roughly 6GB, the 4B roughly 3GB, and the 2B and 0.8B under 2GB, making all four models accessible on consumer hardware including laptops and smartphones

➤ Availability: At time of publishing, there are no first-party or third-party serverless APIs hosting these models.

The Qwen3.5 generation is a step change in small model intelligence over Qwen3. The 9B gains 15 points over Qwen3 VL 8B (17 to 32), the 4B gains 9 points over Qwen3 4B 2507 (18 to 27), the 2B gains 3 points over Qwen3 1.7B (13 to 16), and the 0.8B gains 2.5 points over Qwen3 0.6B (6.5 to 9).

The intelligence gains come at the cost of high token usage compared to peers. All four sub-10B Qwen3.5 models use 230M+ output tokens to run the Intelligence Index - this is significantly higher than frontier models as well as Qwen3 predecessors

The Qwen3.5 9B and 4B models are the most intelligent multimodal models under 15B parameters. On MMMU-Pro, Qwen3.5 9B (69%) and 4B (65%) lead all sub-15B models

Breakdown of individual results for all 4 models

Compare the full Qwen3.5 family with other leading models at: https://artificialanalysis.ai/models/qwen3-5-9b

Qwen3.5 small models: Everything you need to know

Read the latest

Cursor’s Composer 2.5: third on the Coding Agent Index and ~10-60x lower cost than rivals

Cohere launches open weights model Command A+, more than a year since the Command A release

Gemini 3.5 Flash: The new leader in intelligence versus speed