{"id":1144186,"name":"Share of FrontierMath problems solved correctly by AI models","unit":"%","createdAt":"2026-02-02T09:53:35.000Z","updatedAt":"2026-03-08T06:32:00.000Z","coverage":"","timespan":"","datasetId":7342,"shortUnit":"%","columnOrder":0,"shortName":"mean_score","catalogPath":"grapher/artificial_intelligence/2026-01-30/frontiermath/epoch_benchmark_data#mean_score","descriptionShort":"FrontierMath benchmark evaluates models on 300 difficult, research-level problems in advanced mathematics (Tiers 1–3), which can take expert mathematicians hours or days to work through.","descriptionFromProducer":"[FrontierMath](https://epoch.ai/frontiermath) is a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics – from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days.\n\nThe full FrontierMath dataset contains 350 problems. This is split into a base set of 300 problems, which we call Tiers 1-3, and an expansion set of 50 exceptionally difficult problems, which we call Tier 4. We have made 10 problems from Tiers 1-3 public, calling this frontiermath-2025-02-28-public. The remaining 290 problems make up frontiermath-2025-02-28-private. Similarly, we have made 2 problems from Tier 4 public, calling this frontiermath-tier-4-2025-07-01-public, while the remaining 48 problems make up frontiermath-tier-4-2025-07-01-private. Unless explicitly mentioned otherwise, all the numbers on this hub correspond to evaluations on the private sets. You can find more information about the public problems [here](https://epoch.ai/frontiermath/tiers-1-4/benchmark-problems).\n\nFrontierMath was developed with funding from OpenAI, who has exclusive access to a subset of the benchmark.","type":"float","dataChecksum":"6893403416683070993","metadataChecksum":"3897370649862209386","datasetName":"Epoch AI Benchmark Data","updatePeriodDays":31,"datasetVersion":"2026-01-30","nonRedistributable":false,"display":{"unit":"%","zeroDay":"2024-06-20","shortUnit":"%","yearIsDay":true,"numDecimalPlaces":1},"schemaVersion":2,"processingLevel":"minor","presentation":{"topicTagsLinks":["Artificial Intelligence"]},"descriptionKey":["This indicator shows the share of FrontierMath problems that AI models solve correctly, based on Epoch AI's evaluation.","FrontierMath is a set of 350 original math problems written by experts, covering many areas of advanced mathematics. Many problems are difficult enough that human specialists might need hours or days to solve them.","The benchmark has four difficulty tiers. This indicator shows accuracy on Tiers 1–3 (300 problems). Tier 4 contains 50 exceptionally difficult problems and is not included here.","Scoring is all-or-nothing: models get 1 point for a correct final answer and 0 for anything else, with no partial credit. Models submit their answers as Python code and can use Python while working on problems. This means scores reflect math ability with access to computational tools, not just pen-and-paper reasoning.","Only 12 problems are publicly available, mainly so researchers can inspect how evaluations work, not to report scores.","FrontierMath was developed by Epoch AI with funding from OpenAI, whose GPT models are among those evaluated on this benchmark. OpenAI has exclusive access to a subset of the problems."],"dimensions":{"years":{"values":[{"id":0},{"id":47},{"id":84},{"id":96},{"id":124},{"id":151},{"id":153},{"id":175},{"id":180},{"id":189},{"id":219},{"id":225},{"id":230},{"id":249},{"id":289},{"id":293},{"id":298},{"id":300},{"id":312},{"id":321},{"id":336},{"id":350},{"id":362},{"id":384},{"id":400},{"id":407},{"id":411},{"id":413},{"id":466},{"id":467},{"id":482},{"id":504},{"id":511},{"id":516},{"id":522},{"id":529},{"id":539},{"id":545},{"id":550},{"id":586},{"id":595},{"id":601},{"id":607},{"id":609},{"id":623}]},"entities":{"values":[{"id":372564,"name":"Claude 3.5 Sonnet (Jun 2024)","code":null},{"id":372520,"name":"GPT 4 (Aug 2024)","code":null},{"id":372530,"name":"o1 mini (Sep 2024), high","code":null},{"id":372566,"name":"o1 mini (Sep 2024), medium","code":null},{"id":372525,"name":"Gemini 1.5 Flash 002","code":null},{"id":372576,"name":"Claude 3.5 Haiku (Oct 2024)","code":null},{"id":372533,"name":"Claude 3.5 Sonnet (Oct 2024)","code":null},{"id":372579,"name":"Mistral Large (Nov 2024)","code":null},{"id":372563,"name":"GPT 4 (Nov 2024)","code":null},{"id":372555,"name":"Grok 2","code":null},{"id":372567,"name":"o1 (Dec 2024), high","code":null},{"id":372514,"name":"DeepSeek V3","code":null},{"id":372517,"name":"Qwenmax (Jan 2025)","code":null},{"id":372569,"name":"o3 mini (Jan 2025), high","code":null},{"id":372549,"name":"o3 mini (Jan 2025), medium","code":null},{"id":372524,"name":"Gemini 2.0 Flash 001","code":null},{"id":372535,"name":"Claude 3.7 Sonnet (Feb 2025)","code":null},{"id":372550,"name":"Claude 3.7 Sonnet (Feb 2025), 16K","code":null},{"id":372545,"name":"Claude 3.7 Sonnet (Feb 2025), 32K","code":null},{"id":372521,"name":"Claude 3.7 Sonnet (Feb 2025), 64K","code":null},{"id":372554,"name":"Llama4 Maverick 17B 128E Instruct Fp8","code":null},{"id":372536,"name":"Llama4 Scout 17B 16E Instruct","code":null},{"id":372537,"name":"Grok 3 beta","code":null},{"id":372529,"name":"Grok 3 mini beta, high","code":null},{"id":372572,"name":"Grok 3 mini beta, low","code":null},{"id":372561,"name":"GPT 4.1 (Apr 2025)","code":null},{"id":372526,"name":"GPT 4.1 mini (Apr 2025)","code":null},{"id":372531,"name":"GPT 4.1 nano (Apr 2025)","code":null},{"id":372578,"name":"o3 (Apr 2025), high","code":null},{"id":372548,"name":"o3 (Apr 2025), low","code":null},{"id":372558,"name":"o3 (Apr 2025), medium","code":null},{"id":372528,"name":"o4 mini (Apr 2025), high","code":null},{"id":372510,"name":"o4 mini (Apr 2025), low","code":null},{"id":372518,"name":"o4 mini (Apr 2025), medium","code":null},{"id":372565,"name":"Qwenplus (Apr 2025)","code":null},{"id":372556,"name":"Mistral Medium (May 2025)","code":null},{"id":372516,"name":"Claude Opus 4 (May 2025)","code":null},{"id":372515,"name":"Claude Opus 4 (May 2025), 27K","code":null},{"id":372538,"name":"Claude Sonnet 4 (May 2025)","code":null},{"id":372562,"name":"Gemini 2.5 Pro preview","code":null},{"id":372527,"name":"Gemini 2.5 Flash","code":null},{"id":371990,"name":"Gemini 2.5 Pro","code":null},{"id":371820,"name":"Grok 4","code":null},{"id":372552,"name":"Qwen3 235B A22B Thinking","code":null},{"id":372559,"name":"Gemini 2.5 Deep think (Aug 2025)","code":null},{"id":372511,"name":"Claude Opus 4 (Aug 2025)","code":null},{"id":372547,"name":"Claude Opus 4 (Aug 2025), 27K","code":null},{"id":372570,"name":"GPT 5 (Aug 2025), high","code":null},{"id":372522,"name":"GPT 5 (Aug 2025), medium","code":null},{"id":372551,"name":"GPT 5 mini (Aug 2025), high","code":null},{"id":372577,"name":"GPT 5 mini (Aug 2025), medium","code":null},{"id":372568,"name":"GPT 5 nano (Aug 2025), high","code":null},{"id":372539,"name":"GPT 5 nano (Aug 2025), medium","code":null},{"id":372512,"name":"Claude Sonnet 4 (Sep 2025)","code":null},{"id":372534,"name":"Claude Sonnet 4 (Sep 2025), 32K","code":null},{"id":372560,"name":"Claude Sonnet 4 (Sep 2025), 59K","code":null},{"id":372321,"name":"GLM 4.6","code":null},{"id":372546,"name":"Claude Haiku 4 (Oct 2025)","code":null},{"id":372513,"name":"Claude Haiku 4 (Oct 2025), 32K","code":null},{"id":372341,"name":"Kimi K2 Thinking","code":null},{"id":372519,"name":"GPT 5.1 (Nov 2025), high","code":null},{"id":372540,"name":"GPT 5.1 (Nov 2025), low","code":null},{"id":372574,"name":"GPT 5.1 (Nov 2025), medium","code":null},{"id":372573,"name":"GPT 5.1 _none (Nov 2025)","code":null},{"id":372571,"name":"Gemini 3 Pro preview","code":null},{"id":372553,"name":"Claude Opus 4 (Nov 2025)","code":null},{"id":372543,"name":"Claude Opus 4 (Nov 2025), 16K","code":null},{"id":372541,"name":"Claude Opus 4 (Nov 2025), 32K","code":null},{"id":372575,"name":"DeepSeek V3P2","code":null},{"id":372544,"name":"GPT 5.2 (Dec 2025), high","code":null},{"id":372523,"name":"GPT 5.2 (Dec 2025), low","code":null},{"id":372542,"name":"GPT 5.2 (Dec 2025), medium","code":null},{"id":372532,"name":"GPT 5.2 (Dec 2025), xhigh","code":null},{"id":372557,"name":"Gemini 3 Flash preview","code":null},{"id":372625,"name":"GLM 4.7","code":null},{"id":372626,"name":"Kimi K2P5","code":null},{"id":371813,"name":"Claude Opus 4","code":null},{"id":372628,"name":"Claude Opus 4, 32K","code":null},{"id":372622,"name":"Claude Opus 4, 64K","code":null},{"id":372621,"name":"Claude Opus 4, max","code":null},{"id":372623,"name":"GLM 5","code":null},{"id":372627,"name":"Claude Sonnet 4, 16K","code":null},{"id":372624,"name":"Gemini 3.1 Pro preview","code":null},{"id":372650,"name":"GPT 5.4 (Mar 2026), xhigh","code":null},{"id":372651,"name":"GPT 5.4 pro (Mar 2026), xhigh","code":null}]}},"origins":[{"id":14137,"title":"Epoch AI Benchmark Data","description":"Comprehensive collection of AI benchmark datasets from Epoch AI, including FrontierMath and other performance benchmarks.","producer":"Epoch AI","citationFull":"Epoch AI, ‘AI Benchmarking Hub’. Published online at epoch.ai. Retrieved from ‘https://epoch.ai/benchmarks’ [online resource]. Accessed 30 Jan 2026.","urlMain":"https://epoch.ai/benchmarks","urlDownload":"https://epoch.ai/data/benchmark_data.zip","dateAccessed":"2026-03-07","datePublished":"2026-01-26","license":{"url":"https://epoch.ai/about","name":"CC BY 4.0"}}]}