ย
A collection of physics animations, mostly using p5.js
Inspired by Simon Willison's pelican on a bicycle test
Easy questions that LLMs still trip up on
80s-style arcade racing games
Models responses to life questions
A series of challenging ASCII artworks in different styles
This benchmark is designed to show how difficult this task is for all LLMs.
Renal physiology quiz.
Generate a complete, ready-to-play browser game with a single prompt
Clone of Hacker News website
Endocrine system quiz. Disclaimer: translated from spanish to english
A Twist on the Snake Game where the snake is having an existential crisis
An ASCII artwork of the Eiffel Tower
Generate the complete HTML, CSS, and JavaScript code for a web-based simulation of Conway's Game of Life.
A variety of difficult programming tests with complex input and output parameters.
Who will win the battle of the 7 seater SUV fight to the death
First test
Testing a few simple experimentations and visualizations
Pico
This microeval asks the LLM to produce a single HTML code block with optional CSS JavaScript and GLSL that renders a full screen Julia set shader animation with smooth color transitions from golden yellow through orange magenta purple to deep indigo, continuously morphs via a rotating complex constant, supports click and drag panning mouse wheel or pinch zoom and a space bar toggle for play and pause, and relies solely on Three.js loaded from a CDN.
Words spelt out using indexes of the key on a QWERTY keyboard
Cette รฉvaluation contient des exemples de dialogues simulant des consultations pharmaceutiques courantes. Les prompts incluent des demandes de conseils sur des vitamines, des symptรดmes bรฉnins, ou des situations nรฉcessitant une รฉvaluation initiale avant orientation vers un professionnel de santรฉ. Ils sont conรงus pour tester la capacitรฉ dโun modรจle ร fournir des rรฉponses utiles, sรฉcuritaires et conformes aux bonnes pratiques pharmaceutiques.
Write a tweet-length sci-fi story
An visual IQ test generator
Use svg
Can the models correctly write Svelte 5 components? Do they avoid using patterns from earlier versions?
Cardiac phisiology quiz.
Testing knowledge of Czech culture and language - designed to test smaller models based on https://semanticmachines.notion.site/evals
Glasses of wine are traditionally only half-full.
Kenta
Write a SVG animation that draws a cute kitten using html and css.
The 10 public Simple Bench questions (https://simple-bench.com/)
JPEvalใฏใLLMใฎ่ฆๆใจใใๆฅๆฌ่ชใงๅ้กใ่กใใพใ๏ผ
API Key for "[email protected]"
Reasoning should include the ability to generalize to unfamiliar words instead of memorizing answers. Let's see if models can detect the number of 'r's in the word "strawrbrerrry."
A basic minecraft 3D eval. It should create a basic chunk with a greedy mesher, block placing and distroying and fps camera
The purpose of this is to evaluate how good various AI models are at a variety of Minecraft skills like planning, designing, puzzles, and providing accurate information. Itโll also test to see how good each AI is at coding a web app clone of the game.
More 't' s and a 'T' is added to confuse AI
Simple and detailed prompt and persian version
A set of difficult tasks with the theme of emoji's including: tier list creation, music creation and, themeable website creation.
5 challenging math problem
Detailed prompt and simple prompt
Persian prompts but translated in output
Visual perception of the 5 most important unsolved concepts in mathematics!
Assorted SVG generation prompts
How well can AI create fun, little games as web apps that can be played on mobile devices? This benchmark tests AIโs abilities to generate good mobile UI and controls as well as basic gameplay experiences.
from https://x.com/goodside/status/1934833254726521169/photo/1
THE AI CAN MAKE AN INFOGRAPH ABOUT THE COMPOSITION OF THE GOV.
Micro gold
Tetris that runs in a web browser
Attempted proof of it
Test
-
Ahmed
syuaib
nbhvvg
This prompt tests knowledge and design sense of coding models. It compares smaller and larger models of the same family.
Shows how well the best models can write.
Race 3d
ไธไธชๅธฎๅฉ็ณๅ็ฎก็ๆฅๅธธ้ฅฎ้ฃใไฝ้ข่ฎฐๅฝใ็ๆดปๅจฑไน็ๅฉๆ
This asks about an Australian case that is widely cited, but not widely mentioned on the internet. The prompt is deliberately misleading in that the decision was unanimous
Pallav Agarwal
Simple roleplay prompt example (by olety)
convert a timestamp to epoch
ai
A simple eval to check LLM capability and creativity when expanding an idea seed into more refined ideas.
Which llm is the best at generating Roblox related code?
ุดุจุณ
Trading Analysis Using Elliott Wave
test
Revenue analysis of Apple
ุงู
้พ
MCP server to facilitate other integrate with us
Create voice agent pricing calculator by Nikhil. R
no
A high-level physics benchmarck
222
e
Creates dynamic IQ Tests
meh meh
Adapted from 2024 IMO Problem 5
Build out industry supply demand
Generation of a HTML application that renders a mathematically-accurate black hole. (I already feel the fascination the fans of Interstellar have right now.)
The only GSM8K problem that most frontier models get wrong.
Create a website that simulates the Tower of Hanoi the puzzle
AI plays the markets
ุจูุจ
Creates a 3d minecraft like game
One shot GTA Clone game
syuaib
Laboratorio donde poder crear tus propios arquitecturas de Ia y redes neuronales
make a cat