Multilingual AI Model Benchmark Compare Leading LLMs by Language
Top models 🌐All (average)
Top models 🇬🇧English
Top models 🇨🇳Chinese
Top models 🇮🇳Hindi
Top models 🇪🇸Spanish
Top models 🇫🇷French
Top models in other languages
Explore how leading large language models (LLMs) perform across multiple languages on Artificial Analysis' Multilingual Index, including the Global-MMLU-Lite benchmark. Filter by language and model, view trade-offs between accuracy, speed, and cost, and find the best LLM for your multilingual use case.
For details on datasets and methodology, see the FAQ page.
Artificial Analysis Multilingual Index
An index assessing multilingual performance in general reasoning across multiple languages. Results are computed across English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Portuguese, Indonesian, Japanese, Swahili, German, Korean, Italian, Yoruba, Burmese. See Multilingual Intelligence Index methodology for further details.
Multilingual Index Across Languages (Normalized)
An index assessing multilingual performance in general reasoning across multiple languages. Results are computed across English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Portuguese, Indonesian, Japanese, Swahili, German, Korean, Italian, Yoruba, Burmese. See Multilingual Intelligence Index methodology for further details.
Multilingual Index: Average Across All Languages
An index assessing multilingual performance in general reasoning across multiple languages. Results are computed across English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Portuguese, Indonesian, Japanese, Swahili, German, Korean, Italian, Yoruba, Burmese. See Multilingual Intelligence Index methodology for further details.
Multilingual Index: Average vs. Output Speed
There is a trade-off between model quality and output speed, with higher intelligence models typically having lower output speed.
An index assessing multilingual performance in general reasoning across multiple languages. Results are computed across English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Portuguese, Indonesian, Japanese, Swahili, German, Korean, Italian, Yoruba, Burmese. See Multilingual Intelligence Index methodology for further details.
Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).
Multilingual Index: Average vs. Price
While higher intelligence models are typically more expensive, they do not all follow the same price-quality curve.
An index assessing multilingual performance in general reasoning across multiple languages. Results are computed across English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Portuguese, Indonesian, Japanese, Swahili, German, Korean, Italian, Yoruba, Burmese. See Multilingual Intelligence Index methodology for further details.
Price per token, represented as USD per million Tokens. Price is a blend of Input & Output token prices (3:1 ratio).
Multilingual Global-MMLU-Lite: Average
A multilingual version of Massive Multitask Language Understanding, evaluated across multiple languages. Tests general knowledge and reasoning ability in areas like science, humanities, mathematics and more. See methodology for further details.
Pricing: Input and Output Prices
Price per token included in the request/message sent to the API, represented as USD per million Tokens.
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).
Output Speed
Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).
Latency: Time To First Answer Token
Time to first answer token received, in seconds, after API request sent. For reasoning models, this includes the 'thinking' time of the model before providing an answer. For models which do not support streaming, this represents time to receive the completion.
End-to-End Response Time
Seconds to receive a 500 token response. Key components:
- Input time: Time to receive the first response token
- Thinking time (only for reasoning models): Time reasoning models spend outputting tokens to reason prior to providing an answer. Amount of tokens based on the average reasoning tokens across a diverse set of 60 prompts (methodology details).
- Answer time: Time to generate 500 output tokens, based on output speed
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).