Stripe
Adyen
Braintree
AI × Payments·Updated Monthly·Vendor Benchmark

Which payment API works best for AI agents?

17 benchmarked models. 19 scored workflow scenarios in the current release, with 19 defined across 3 weighted categories. Two leaderboards: agentic tool use and code generation.

Provider

StripeSoon
AdyenSoon
BraintreeSoon

Month

April 2026Soon

This benchmark is tracking how models perform across payment APIs. Public provider comparisons are rolling out; use the month selector to compare releases over time.

Agentic Workflows

- tool-use function calling

19 workflow scenarios are defined in the benchmark. 19 are scored in this published release. Each model gets a single tool (flint_api_call) and must complete the task autonomously in a loop. Scored on pass/fail and step efficiency: fewer API calls to complete the same task means the API is doing more of the work.

#ModelOverallPassedAvg Stepsvs Last Mo.
1
OpenAIGPT-5 Miniv#1
98
19/191.1×
2
GoogleGemini 3.1 Flashv
94
18/191.1×
3
AnthropicClaude Sonnet 4.6v
94
18/191.1×
4
OpenAIGPT-5.4v
93
18/191.1×
5
OpenAIo4-miniv
91
18/191.5×
6
AnthropicClaude Haiku 4.5v
91
18/191.2×
7
xAIGrok 4.1 Fastv
89
17/191.1×
8
GoogleGemini 3.1 Prov
89
17/191.1×
9
AlibabaQwen 3.5 Plusv
89
17/191.1×
10
AnthropicClaude Opus 4.6v
88
17/191.1×
11
DeepSeekDeepSeek V3.2v
87
18/191.4×
12
OpenAIo3v
84
17/191.3×
13
OpenAIGPT-4.1v
80
15/191.2×
14
GoogleGemini 3.1 Flash Litev
79
15/191.2×
15
xAIGrok 4v
77
15/191.4×
16
MetaLlama 4 Maverickv
38
4/191.3×
17
MistralMistral Large 3v
0
0/190.0×

Click any row to see per-scenario breakdown. Avg Steps = ratio of actual tool calls vs minimum (1.0× = optimal). · Tested 2026-03-20 · Raw data & scripts →

Code Generation LeaderboardComing Soon

How well models write integration code from docs — API calls, field encoding, pagination, error handling.

How It Works