17 benchmarked models. 19 scored workflow scenarios in the current release, with 19 defined across 3 weighted categories. Two leaderboards: agentic tool use and code generation.
Provider
Month
This benchmark is tracking how models perform across payment APIs. Public provider comparisons are rolling out; use the month selector to compare releases over time.
19 workflow scenarios are defined in the benchmark. 19 are scored in this published release. Each model gets a single tool (flint_api_call) and must complete the task autonomously in a loop. Scored on pass/fail and step efficiency: fewer API calls to complete the same task means the API is doing more of the work.
| # | Model | Overall | Passed | Avg Steps | vs Last Mo. |
|---|---|---|---|---|---|
| 1 | OpenAIGPT-5 Miniv#1 | 98 | 19/19 | 1.1× | |
| 2 | GoogleGemini 3.1 Flashv | 94 | 18/19 | 1.1× | |
| 3 | AnthropicClaude Sonnet 4.6v | 94 | 18/19 | 1.1× | |
| 4 | OpenAIGPT-5.4v | 93 | 18/19 | 1.1× | |
| 5 | OpenAIo4-miniv | 91 | 18/19 | 1.5× | |
| 6 | AnthropicClaude Haiku 4.5v | 91 | 18/19 | 1.2× | |
| 7 | xAIGrok 4.1 Fastv | 89 | 17/19 | 1.1× | |
| 8 | GoogleGemini 3.1 Prov | 89 | 17/19 | 1.1× | |
| 9 | AlibabaQwen 3.5 Plusv | 89 | 17/19 | 1.1× | |
| 10 | AnthropicClaude Opus 4.6v | 88 | 17/19 | 1.1× | |
| 11 | DeepSeekDeepSeek V3.2v | 87 | 18/19 | 1.4× | |
| 12 | OpenAIo3v | 84 | 17/19 | 1.3× | |
| 13 | OpenAIGPT-4.1v | 80 | 15/19 | 1.2× | |
| 14 | GoogleGemini 3.1 Flash Litev | 79 | 15/19 | 1.2× | |
| 15 | xAIGrok 4v | 77 | 15/19 | 1.4× | |
| 16 | MetaLlama 4 Maverickv | 38 | 4/19 | 1.3× | |
| 17 | MistralMistral Large 3v | 0 | 0/19 | 0.0× |
Click any row to see per-scenario breakdown. Avg Steps = ratio of actual tool calls vs minimum (1.0× = optimal). · Tested 2026-03-20 · Raw data & scripts →
How well models write integration code from docs — API calls, field encoding, pagination, error handling.