Menu

logo
Artificial Analysis
HOME

Multilingual Reasoning: Comparison of Leading AI Models by Language

Multilingual Reasoning Comparison and analysis of AI models reasoning, performance, and price across different languages. Click on any model to see detailed metrics. For more details including relating to our methodology, see our FAQs.

Ranking of top 5 models by language

English 🇬🇧:
Claude 3.5 Sonnet (Oct '24)Claude 3.5 Sonnet (Oct '24)
(94)
GPT-4o (Aug '24)GPT-4o (Aug '24)
(93)
DeepSeek V3 (Dec '24)DeepSeek V3 (Dec '24)
(92)
Qwen2.5 Instruct 72BQwen2.5 Instruct 72B
(91)
o1-minio1-mini
(91)
French 🇫🇷:
Claude 3.5 Sonnet (Oct '24)Claude 3.5 Sonnet (Oct '24)
(89)
Llama 3.3 Instruct 70BLlama 3.3 Instruct 70B
(86)
DeepSeek V3 (Dec '24)DeepSeek V3 (Dec '24)
(86)
o1-minio1-mini
(85)
Qwen2.5 Instruct 72BQwen2.5 Instruct 72B
(85)
Spanish 🇪🇸:
Claude 3.5 Sonnet (Oct '24)Claude 3.5 Sonnet (Oct '24)
(91)
DeepSeek V3 (Dec '24)DeepSeek V3 (Dec '24)
(90)
GPT-4o (Aug '24)GPT-4o (Aug '24)
(88)
Qwen2.5 Instruct 72BQwen2.5 Instruct 72B
(88)
o1-minio1-mini
(87)
Chinese 🇨🇳:
Claude 3.5 Sonnet (Oct '24)Claude 3.5 Sonnet (Oct '24)
(88)
GPT-4o (Aug '24)GPT-4o (Aug '24)
(87)
o1-minio1-mini
(87)
DeepSeek V3 (Dec '24)DeepSeek V3 (Dec '24)
(87)
Gemini 1.5 Pro (Sep '24)Gemini 1.5 Pro (Sep '24)
(86)
Japanese 🇯🇵:
Claude 3.5 Sonnet (Oct '24)Claude 3.5 Sonnet (Oct '24)
(87)
o1-minio1-mini
(86)
DeepSeek V3 (Dec '24)DeepSeek V3 (Dec '24)
(86)
GPT-4o (Aug '24)GPT-4o (Aug '24)
(85)
Gemini 1.5 Pro (Sep '24)Gemini 1.5 Pro (Sep '24)
(84)

Multilingual Index by Language

Artificial Analysis Multilingual Index; Higher is better
English
Spanish
French
German
Swahili
Bengali
Chinese
Japanese
+ Add model from specific provider
Artificial Analysis Multilingual Intelligence Index: An average result across evaluations assessing multilingual performance in various dimensions of model intelligence. Includes MMLU (general reasoning) and MGSM (mathematical reasoning), with results computed across multiple languages. See Multilingual Intelligence Index methodology for further details.

Intelligence & Context Window

Multilingual Index (Average Across Languages)

Artificial Analysis Multilingual Index; Higher is better
+ Add model from specific provider
Artificial Analysis Multilingual Intelligence Index: An average result across evaluations assessing multilingual performance in various dimensions of model intelligence. Includes MMLU (general reasoning) and MGSM (mathematical reasoning), with results computed across multiple languages. See Multilingual Intelligence Index methodology for further details.

Multilingual Index vs. Output Speed

Artificial Analysis Multilingual Index; Output Speed: Output Tokens per Second; 1,000 Input Tokens
Most attractive quadrant
o1
o1-mini
GPT-4o (Aug '24)
GPT-4o mini
Llama 3.3 70B
Llama 3.1 8B
Gemini 1.5 Pro (Sep)
Gemini 1.5 Flash (Sep)
Claude 3.5 Sonnet (Oct)
Claude 3.5 Haiku
Mistral Large 2 (Nov '24)
Nova Pro
Aya Expanse 32B
Qwen2.5 72B
DeepSeek V3 (Dec '24)
+ Add model from specific provider
There is a trade-off between model quality and output speed, with higher intelligence models typically having lower output speed.
Artificial Analysis Multilingual Intelligence Index: An average result across evaluations assessing multilingual performance in various dimensions of model intelligence. Includes MMLU (general reasoning) and MGSM (mathematical reasoning), with results computed across multiple languages. See Multilingual Intelligence Index methodology for further details.
Output Speed: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).

Multilingual Index vs. Price

Artificial Analysis Multilingual Index; Price: USD per 1M Tokens
Most attractive quadrant
o1
o1-mini
GPT-4o (Aug '24)
GPT-4o mini
Llama 3.3 70B
Llama 3.1 8B
Gemini 1.5 Pro (Sep)
Gemini 1.5 Flash (Sep)
Claude 3.5 Sonnet (Oct)
Claude 3.5 Haiku
Mistral Large 2 (Nov '24)
Nova Pro
Aya Expanse 32B
DeepSeek V3 (Dec '24)
+ Add model from specific provider
While higher intelligence models are typically more expensive, they do not all follow the same price-quality curve.
Artificial Analysis Multilingual Intelligence Index: An average result across evaluations assessing multilingual performance in various dimensions of model intelligence. Includes MMLU (general reasoning) and MGSM (mathematical reasoning), with results computed across multiple languages. See Multilingual Intelligence Index methodology for further details.
Price: Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).

Multilingual MMLU, Multilingual GSM

Multilingual MMLU , Higher is better; Multilingual GSM , Higher is better
MMMLU
MGSM
+ Add model from specific provider
Multilingual MMLU: Massive Multitask Language Understanding, evaluated across multiple languages. Tests general knowledge and reasoning ability in areas like science, humanities, mathematics and more.
Multilingual GSM: Multilingual Grade School Math, evaluates mathematical reasoning ability across different languages using grade school level word problems.

Pricing: Input and Output Prices

Price: USD per 1M Tokens; Lower is better
Input price
Output price
+ Add model from specific provider
The relative importance of input vs. output token prices varies by use case. E.g. Generation tasks are typically more input token weighted while document-focused tasks (e.g. RAG) are more output token weighted.
Input Price: Price per token included in the request/message sent to the API, represented as USD per million Tokens.
Output Price: Price per token generated by the model (received from the API), represented as USD per million Tokens.
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).

Output Speed

Output Tokens per Second; Higher is better; 1,000 Input Tokens
+ Add model from specific provider
Output Speed: Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).

Latency

Seconds to First Token Received; Lower is better; 1,000 Input Tokens
+ Add model from specific provider
Latency (Time to First Token): Time to first token received, in seconds, after API request sent. For reasoning models which share reasoning tokens, this will be the first reasoning token. For models which do not support streaming, this represents time to receive the completion.
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).

End-to-End Response Time

Seconds to Output 500 Tokens, including reasoning model 'thinking' time; Lower is better; 1,000 Input Tokens
+ Add model from specific provider
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).

Models compared: OpenAI: GPT 4o Audio, GPT 4o Realtime, GPT 4o Speech Pipeline, GPT-3.5 Turbo, GPT-3.5 Turbo (0125), GPT-3.5 Turbo (0314), GPT-3.5 Turbo (1106), GPT-3.5 Turbo Instruct, GPT-4, GPT-4 Turbo, GPT-4 Turbo (0125), GPT-4 Turbo (1106), GPT-4 Vision, GPT-4.5 (Preview), GPT-4o (Aug '24), GPT-4o (ChatGPT), GPT-4o (March 2025), GPT-4o (May '24), GPT-4o (Nov '24), GPT-4o Realtime (Dec '24), GPT-4o mini, GPT-4o mini Realtime (Dec '24), o1, o1-mini, o1-preview, o1-pro, o3, o3-mini, and o3-mini (high), Meta: Code Llama 70B, Llama 2 Chat 13B, Llama 2 Chat 70B, Llama 2 Chat 7B, Llama 3 70B, Llama 3 8B, Llama 3.1 405B, Llama 3.1 70B, Llama 3.1 8B, Llama 3.2 11B (Vision), Llama 3.2 1B, Llama 3.2 3B, Llama 3.2 90B (Vision), and Llama 3.3 70B, Google: Gemini 1.0 Pro, Gemini 1.5 Flash (May), Gemini 1.5 Flash (Sep), Gemini 1.5 Flash-8B, Gemini 1.5 Pro (May), Gemini 1.5 Pro (Sep), Gemini 2.0 Flash, Gemini 2.0 Flash (exp), Gemini 2.0 Flash Thinking exp. (Dec '24), Gemini 2.0 Flash Thinking exp. (Jan '25), Gemini 2.0 Flash-Lite (Feb '25), Gemini 2.0 Flash-Lite (Preview), Gemini 2.0 Pro Experimental, Gemini 2.5 Pro Experimental, Gemini Experimental (Nov), Gemma 2 27B, Gemma 2 9B, Gemma 3 12B, Gemma 3 1B, Gemma 3 27B, Gemma 3 4B, and Gemma 7B, Anthropic: Claude 2.0, Claude 2.1, Claude 3 Haiku, Claude 3 Opus, Claude 3 Sonnet, Claude 3.5 Haiku, Claude 3.5 Sonnet (June), Claude 3.5 Sonnet (Oct), Claude 3.7 Sonnet Thinking, Claude 3.7 Sonnet, and Claude Instant, Mistral: Codestral (Jan '25), Codestral (May '24), Codestral-Mamba, Ministral 3B, Ministral 8B, Mistral 7B, Mistral Large (Feb '24), Mistral Large 2 (Jul '24), Mistral Large 2 (Nov '24), Mistral Medium, Mistral NeMo, Mistral Saba, Mistral Small (Feb '24), Mistral Small (Sep '24), Mistral Small 3, Mistral Small 3.1, Mixtral 8x22B, Mixtral 8x7B, Pixtral 12B, and Pixtral Large, DeepSeek: DeepSeek Coder V2 Lite, DeepSeek LLM 67B (V1), DeepSeek R1, DeepSeek R1 (FP4), DeepSeek R1 Distill Llama 70B, DeepSeek R1 Distill Llama 8B, DeepSeek R1 Distill Qwen 1.5B, DeepSeek R1 Distill Qwen 14B, DeepSeek R1 Distill Qwen 32B, DeepSeek V3 (Dec '24), DeepSeek V3 (Mar' 25), DeepSeek-Coder-V2, DeepSeek-V2, DeepSeek-V2.5, DeepSeek-V2.5 (Dec '24), DeepSeek-VL2, and Janus Pro 7B, Perplexity: PPLX-70B Online, PPLX-7B-Online, R1 1776, Sonar, Sonar 3.1 Huge, Sonar 3.1 Large, Sonar 3.1 Small , Sonar Large, Sonar Pro, Sonar Reasoning, Sonar Reasoning Pro, and Sonar Small, xAI: Grok 2, Grok 3, Grok 3 Reasoning Beta, Grok 3 mini, Grok 3 mini Reasoning, Grok Beta, and Grok-1, OpenChat: OpenChat 3.5, Amazon: Nova Lite, Nova Micro, and Nova Pro, Microsoft Azure: Phi-3 Medium 14B, Phi-3 Mini, Phi-4, Phi-4 Mini, and Phi-4 Multimodal, Liquid AI: LFM 1.3B, LFM 3B, and LFM 40B, Upstage: Solar Mini, Solar Pro, and Solar Pro (Nov '24), Databricks: DBRX, MiniMax: MiniMax-Text-01, NVIDIA: Cosmos Nemotron 34B, Llama 3.1 Nemotron 70B, Llama 3.1 Nemotron Nano 8B, Llama 3.3 Nemotron Nano 8B v1 (Reasoning), Llama 3.3 Nemotron Super 49B v1, and Llama 3.3 Nemotron Super 49B v1 (Reasoning), IBM: Granite 3.0 2B, OpenVoice: Granite 3.0 8B, Inceptionlabs: Mercury Coder Mini, Mercury Coder Small, and Mercury Instruct, Reka AI: Reka Core, Reka Edge, Reka Flash (Feb '24), Reka Flash, and Reka Flash 3, Other: LLaVA-v1.5-7B, Cohere: Aya Expanse 32B, Aya Expanse 8B, Command, Command A, Command Light, Command R7B, Command-R, Command-R (Mar '24), Command-R+ (Apr '24), and Command-R+, AI21 Labs: Jamba 1.5 Large, Jamba 1.5 Large (Feb '25), Jamba 1.5 Mini, Jamba 1.5 Mini (Feb 2025), Jamba 1.6 Large, Jamba 1.6 Mini, and Jamba Instruct, Snowflake: Arctic, Alibaba: QwQ-32B, QwQ 32B-Preview, Qwen Chat 72B, Qwen Plus, Qwen Turbo, Qwen1.5 Chat 110B, Qwen1.5 Chat 14B, Qwen1.5 Chat 32B, Qwen1.5 Chat 72B, Qwen1.5 Chat 7B, Qwen2 72B, Qwen2 Instruct 7B, Qwen2 Instruct A14B 57B, Qwen2-VL 72B, Qwen2.5 Coder 32B, Qwen2.5 Coder 7B , Qwen2.5 Instruct 14B, Qwen2.5 Instruct 32B, Qwen2.5 72B, Qwen2.5 Instruct 7B, Qwen2.5 Max, Qwen2.5 Max 01-29, Qwen2.5 VL 72B, and Qwen2.5 VL 7B, and 01.AI: Yi-Large.

Multilingual Reasoning of Leading AI Models by Language | Artificial Analysis