How the Models Index is built.
A transparent, independently reproducible rating system for frontier AI models. Six capability dimensions. 30+ public benchmark sources. Quarterly refresh. Labs cannot pay to improve their grade — methodology, benchmark sources, and red-team disclosures are published in full.
Guiding Principles.
The Frontier Models Index exists to answer a single question for engineers, procurement, and AI strategy leaders: which model should actually ship in production?
- Public benchmarks only. Every input sourced from public benchmarks (SWE-Bench Verified, GPQA-Diamond, AIME, MMLU-Pro, TAU-bench, MMMU, HarmBench), published red-team disclosures, model cards, and FASCIA's proprietary evaluation suite (published).
- Labs cannot pay. No AI lab has paid, can pay, or has been offered the opportunity to pay for inclusion, exclusion, or modification of their model's grade.
- Quarterly refresh. Grades update every 90 days; new model releases trigger interim updates. Material changes timestamped.
- Subscore transparency. Every grade decomposes into six public capability subscores.
- Right of correction. Labs may submit documented corrections via published Appeals process, particularly when published benchmark results lag actual model state.
- Benchmark contamination tracking. Models with documented benchmark-data contamination receive published penalty annotations.
The Six Capability Dimensions.
General Reasoning
Multi-domain reasoning capability. Sources: MMLU-Pro, GPQA-Diamond (graduate-level reasoning), Big-Bench Hard, ARC-AGI.
Code Generation
Production-grade code generation. Sources: SWE-Bench Verified (real-world software engineering tasks), HumanEval+, MBPP+, LiveCodeBench.
Math & STEM
Mathematical and scientific reasoning. Sources: MATH, AIME (American Invitational Mathematics Examination), FrontierMath, SciCode.
Tool Use & Agency
Multi-step agentic workflows. Sources: TAU-bench (customer-service agency), AgentBench, BFCL (Berkeley Function Calling Leaderboard), SWE-Bench Verified Agentic.
Multimodal
Vision, audio, and video understanding. Sources: MMMU, VQA, AudioBench, video-QA benchmarks.
Safety & Alignment
Refusal-of-misuse behavior + adversarial robustness. Sources: HarmBench, published red-team disclosures (Anthropic, OpenAI, UK AISI, US AISI), refusal evaluation suites.
Weighting & Scoring.
Why these weights? General Reasoning and Code Generation tie for highest weight because they are the highest-frequency production workloads in enterprise AI deployment. Tool Use & Agency is weighted heavily because agentic workflows are the dominant frontier in 2026. Safety & Alignment is weighted at 14% — material but not dominant, reflecting that enterprise risk-sensitivity varies by application. Multimodal weight is lower because most enterprise workloads are text-primary; multimodal-heavy applications should reweight using the published methodology.