How the Models Index is built.

A transparent, independently reproducible rating system for frontier AI models. Six capability dimensions. 30+ public benchmark sources. Quarterly refresh. Labs cannot pay to improve their grade — methodology, benchmark sources, and red-team disclosures are published in full.

/ 01

Guiding Principles.

The Frontier Models Index exists to answer a single question for engineers, procurement, and AI strategy leaders: which model should actually ship in production?

Public benchmarks only. Every input sourced from public benchmarks (SWE-Bench Verified, GPQA-Diamond, AIME, MMLU-Pro, TAU-bench, MMMU, HarmBench), published red-team disclosures, model cards, and FASCIA's proprietary evaluation suite (published).
Labs cannot pay. No AI lab has paid, can pay, or has been offered the opportunity to pay for inclusion, exclusion, or modification of their model's grade.
Quarterly refresh. Grades update every 90 days; new model releases trigger interim updates. Material changes timestamped.
Subscore transparency. Every grade decomposes into six public capability subscores.
Right of correction. Labs may submit documented corrections via published Appeals process, particularly when published benchmark results lag actual model state.
Benchmark contamination tracking. Models with documented benchmark-data contamination receive published penalty annotations.

/ 02

The Six Capability Dimensions.

/ Dimension 01

General Reasoning

Multi-domain reasoning capability. Sources: MMLU-Pro, GPQA-Diamond (graduate-level reasoning), Big-Bench Hard, ARC-AGI.

/ Dimension 02

Code Generation

Production-grade code generation. Sources: SWE-Bench Verified (real-world software engineering tasks), HumanEval+, MBPP+, LiveCodeBench.

/ Dimension 03

Math & STEM

Mathematical and scientific reasoning. Sources: MATH, AIME (American Invitational Mathematics Examination), FrontierMath, SciCode.

/ Dimension 04

Tool Use & Agency

Multi-step agentic workflows. Sources: TAU-bench (customer-service agency), AgentBench, BFCL (Berkeley Function Calling Leaderboard), SWE-Bench Verified Agentic.

/ Dimension 05

Multimodal

Vision, audio, and video understanding. Sources: MMMU, VQA, AudioBench, video-QA benchmarks.

/ Dimension 06

Safety & Alignment

Refusal-of-misuse behavior + adversarial robustness. Sources: HarmBench, published red-team disclosures (Anthropic, OpenAI, UK AISI, US AISI), refusal evaluation suites.

/ 03

Weighting & Scoring.

Capability Dimension

Weight

General Reasoning

22%

Code Generation

22%

Tool Use & Agency

18%

Math & STEM

14%

Safety & Alignment

14%

Multimodal

10%

Composite Score

100%

Why these weights? General Reasoning and Code Generation tie for highest weight because they are the highest-frequency production workloads in enterprise AI deployment. Tool Use & Agency is weighted heavily because agentic workflows are the dominant frontier in 2026. Safety & Alignment is weighted at 14% — material but not dominant, reflecting that enterprise risk-sensitivity varies by application. Multimodal weight is lower because most enterprise workloads are text-primary; multimodal-heavy applications should reweight using the published methodology.

Independent benchmarks. Independent standards.

The Frontier Models Index is published under a methodology that is reproducible from public benchmark sources. AI labs cannot pay to improve grades. The data is licensed to institutional users via the API.

Browse the Index Institutional API