Which payment API works best for AI agents?

17 benchmarked models. 19 scored workflow scenarios in the current release, with 19 defined across 3 weighted categories. Two leaderboards: agentic tool use and code generation.

See full results →How we test →

Provider

StripeSoon

AdyenSoon

BraintreeSoon

Month

April 2026Soon

This benchmark is tracking how models perform across payment APIs. Public provider comparisons are rolling out; use the month selector to compare releases over time.

Agentic Workflows

- tool-use function calling

19 workflow scenarios are defined in the benchmark. 19 are scored in this published release. Each model gets a single tool (flint_api_call) and must complete the task autonomously in a loop. Scored on pass/fail and step efficiency: fewer API calls to complete the same task means the API is doing more of the work.

#	Model	Overall	Passed	Avg Steps
1	OpenAIGPT-5 Miniv#1	98	19/19	1.1×
2	GoogleGemini 3.1 Flashv	94	18/19	1.1×
3	AnthropicClaude Sonnet 4.6v	94	18/19	1.1×
4	OpenAIGPT-5.4v	93	18/19	1.1×
5	OpenAIo4-miniv	91	18/19	1.5×
6	AnthropicClaude Haiku 4.5v	91	18/19	1.2×
7	xAIGrok 4.1 Fastv	89	17/19	1.1×
8	GoogleGemini 3.1 Prov	89	17/19	1.1×
9	AlibabaQwen 3.5 Plusv	89	17/19	1.1×
10	AnthropicClaude Opus 4.6v	88	17/19	1.1×
11	DeepSeekDeepSeek V3.2v	87	18/19	1.4×
12	OpenAIo3v	84	17/19	1.3×
13	OpenAIGPT-4.1v	80	15/19	1.2×
14	GoogleGemini 3.1 Flash Litev	79	15/19	1.2×
15	xAIGrok 4v	77	15/19	1.4×
16	MetaLlama 4 Maverickv	38	4/19	1.3×
17	MistralMistral Large 3v	0	0/19	0.0×

Click any row to see per-scenario breakdown. Avg Steps = ratio of actual tool calls vs minimum (1.0× = optimal). · Tested 2026-03-20 · Raw data & scripts →

Code Generation LeaderboardComing Soon

How well models write integration code from docs — API calls, field encoding, pagination, error handling.

Which payment API works best for AI agents?

Agentic Workflows

How It Works