Qwen 3.7 Max and Plus: New Best Flagship Models From China?

TLDR

Qwen 3.7 is Alibaba’s leading proprietary reasoning AI model.
It was launched on 20 May 2026 at the Alibaba Cloud Summit.
With its release, Alibaba increased the context window from 256K tokens on Qwen 3.6 Max to 1M tokens on Qwen 3.7 Max, bringing it in line with Claude Opus 4.8 and other contemporary flagship models.
Alibaba reports that the model can sustain over 35 hours of continuous execution, working on a single prompt task for more than a full day.
The API is priced at $2.50 per million input tokens and $7.50 per million output tokens.
An important limitation to be aware of is that the Max model is text-only; however, Plus Qwen 3.7 is multimodal and can understand images.
Users will be pleased to learn that Qwen 3.7 Max is compatible with the Anthropic API, enabling it to work with Claude Code straight away.

‍

What is Qwen 3.7 Max?

Qwen 3.7 Max is Alibaba's flagship model. It is a proprietary, closed-weight reasoning model that builds on the Qwen 3.6 series.

‍

Qwen 3.7 Max is a reasoning model, meaning it generates an internal chain of thought before producing a final answer.

‍

This extended-thinking approach is standard across frontier models, but Alibaba has optimised it for continuous operation. As a result, Qwen 3.6 can work independently for over 35 hours without interruption. This refers to agentic tasks, not answering a simple question, of course.

‍

In practice, Qwen 3.7 excels at coding and boasts highly accurate long-context retrieval. Most notably, it can solve complex tasks with ease. For instance, Alibaba has confirmed that it can work with over 1,000 tools in a single prompt execution.

‍

Alongside Qwen 3.7 Max, Alibaba also previewed Qwen 3.7 Plus — a multimodal sibling we'll cover below.

‍

Qwen 3.7 Features

Now that we have a general understanding of the model, let’s take a closer look at some of its most important features.

‍

1M token context window

The context window of Qwen 3.7 Max increased by around 4x to 1 million tokens, up from 256K for Qwen 3.6 Max. This is the same size as the context window of Claude Opus 4.8, and in practice, it means that the model rarely runs into issues with limited context, even across very long tasks spanning multiple hours. It is also worth noting that the model supports a maximum output of 65,536 tokens per response.

‍

Long-horizon agentic work

Alibaba reports that the model can work independently for over 35 hours at a time, managing more than 1,000 tool calls while retaining high performance. This is impressive, given that the performance of most models tends to degrade the longer they work on a task. In practice, this translates to much faster execution and a reduced token burn rate. In one internal test, Alibaba measured a speed improvement of around 10x over the previous version.

‍

Native Anthropic API compatibility

If you’re a Claude Code user, you’ll be pleased to learn that Qwen 3.7 Max natively supports the Anthropic API, meaning you can use it with Claude Code straight away. Alibaba also reports that it works very well with OpenClaw.

‍

Prompt caching

GPT 3.7 Max supports prompt caching, which is a mechanism where the model can remember — or cache — prompts that are reused multiple times, and this drastically reduces the cost of use. Cached input tokens cost $0.25 per million, which translates to a 90% discount versus the standard $2.50 input rate.

‍

Qwen 3.7 Plus

Alongside Max, Alibaba also released Qwen 3.7 Plus, which brings the ability to understand images to the table. Plus ranked #16 globally for Vision on LM Arena, so it’s not the best in class, but decent.

‍

Qwen 3.7 Benchmarks

In this section, we will examine the benchmarks for the Qwen 3.7 Max version specifically, since this is the version available on Overchat AI. We will also explain what each benchmark means and how to interpret it.

‍

To start, Qwen 3.7 Max posts a score of 56.6 on the Artificial Analysis Intelligence Index v4.0 benchmark (measuring the overall performance), and this is the highest placement for a Chinese model at the time of release.

‍

You can see the comprehensive benchmark table below, and later we’ll compare these numbers against other models in the same weight class.

‍

Category	Benchmark	Score
Coding	SWE-Bench Verified	80.4%
Coding	SWE-Bench Pro	60.6%
Coding	Terminal-Bench 2.0	69.7%
Coding	LiveCodeBench	91.6%
Reasoning	GPQA Diamond	92.4%
Reasoning	HMMT 2026 Feb	97.1%
Reasoning	Humanity's Last Exam (no tools)	41.4%
Reasoning	Apex	44.5
Agent & tool use	MCP-Atlas	76.4%
Agent & tool use	MCP-Mark	60.8%
Agent & tool use	BFCL-V4	75.0%
Agent & tool use	Kernel Bench L3 (median speedup)	1.98x
Long context	MRCR-v2 @ 128k	90.4%

‍

Qwen 3.7 Max vs Claude Opus 4.8

Claude Opus 4.8 is Anthropic's flagship, and widely considered one of the best coding models — does the new Chinese AI model beat it?

‍

Benchmark	Qwen 3.7 Max	Claude Opus 4.8
SWE-Bench Verified	80.4%	88.6%
SWE-Bench Pro	60.6%	69.2%
GPQA Diamond	92.4%	93.6%
AA Intelligence Index	56.6	61.4
Input / Output per 1M	$2.50 / $7.50	$5.00 / $25.00

‍

As you can see, Opus 4.8 performs better than Qwen 3.7 Max on every coding and reasoning benchmark in this set. However, it's important to keep in mind that Qwen 3.7 Max is much cheaper in terms of both input and output, and the difference in performance is not nearly as dramatic as the difference in cost.

‍

Qwen 3.7 Max vs GPT-5.5

GPT-5.5 is currently one of the best coding models. But there is one test in which Qwen 3.7 Max outperforms GPT 5.5.

‍

Benchmark	Qwen 3.7 Max	GPT-5.5
SWE-Bench Verified	80.4%	88.7%
SWE-Bench Pro	60.6%	58.6%
GPQA Diamond	92.4%	93.6%
Terminal-Bench 2.0	69.7%	82.7%
AA Intelligence Index	56.6	60.2
Input / Output per 1M	$2.50 / $7.50	$5.00 / $30.00

‍

That’s SWE-Bench Pro, and this is interesting because it is a coding benchmark comprising a handpicked set of realistic coding tasks. This indicates excellent real-world performance.

‍

Qwen 3.7 Max vs DeepSeek V4 Pro

The DeepSeek V4 Pro is another Chinese AI model. At first glance, it has some similarities with other models, for example, both are 1M-context reasoning models. But here’s how they compare:

‍

Benchmark	Qwen 3.7 Max	DeepSeek V4 Pro
SWE-Bench Verified	80.4%	80.6%
SWE-Bench Pro	60.6%	55.4%
GPQA Diamond	92.4%	88.8%
LiveCodeBench	91.6%	93.5%
Input / Output per 1M	$2.50 / $7.50	$0.435 / $0.87

‍

To put the above numbers into perspective, bear in mind that the DeepSeek V4 Pro is around six times cheaper.

‍

The two models essentially tie on the SWE-Bench verified coding benchmark, but the Qwen 3.7 Max takes a clear lead on the SWE-Bench Pro (meaning it performs better on harder coding questions) and the GPQA Diamond. The latter involves the model solving problems that have been handpicked by PhD-level researchers in biology, physics and chemistry. These problems are specifically designed so that the model cannot simply Google the answer but has to solve them.

‍

Qwen 3.7 Max vs Kimi K2.6

Kimi K2.6 is another AI model from China, specifically from Moonshot AI. It is renowned for its high-quality creative writing and front-end design skills, as well as its ability to deploy many sub-agents for complex tasks through Agent Swarms. But how does it compare to Qwen 3.7?

‍

Benchmark	Qwen 3.7 Max	Kimi K2.6
SWE-Bench Verified	80.4%	80.2%
SWE-Bench Pro	60.6%	58.6%
GPQA Diamond	92.4%	90.5%
LiveCodeBench	91.6%	89.6%
Terminal-Bench 2.0	69.7%	66.7%
Input / Output per 1M	$2.50 / $7.50	$0.95 / $4.00

‍

Qwen 3.7 Max wins on every benchmark in this set, but just barely — within 3 points on most of them.

The Kimi K2.6 is also roughly a third of the price and is open-weight. The biggest difference emerges when it comes to harder reasoning tasks, where Qwen 3.7 Max pulls ahead.

‍

Qwen 3.7 Max vs Gemini 3.5 Flash

Gemini 3.5 Flash is Google's new model, and it’s very interesting because Google designed it to be simultaneously fast, cheap, and extremely powerful. It matches flagship level performance at 4x the speed, and even beats Gemini 3.1 Pro on most benchmarks.

‍

Benchmark	Qwen 3.7 Max	Gemini 3.5 Flash
SWE-Bench Pro	60.6%	55.1%
AA Intelligence Index	56.6	55
Humanity's Last Exam (no tools)	40.2%	41.4%
Input / Output per 1M	$2.50 / $7.50	$1.50 / $9.00

‍

Gemini 3.5 Flash is the only model on this list priced near Qwen 3.7 Max. Qwen 3.7 Max leads on SWE-Bench Pro and Intelligence Index, but Gemini 3.5 Flash runs roughly 4x faster on output tokens per second and ships natively with multimodal support.

‍

Qwen 3.7 Max Pricing

Here’s how much Qwen 3.7 usage will cost you if you plan to deploy it via the API, per one million tokens:

‍

Standard Input: $2.50/1M tokens
Cahced Input: $0.25/1M tokens
Ouput: $7.50/1M tokens

‍

To put that into perspective, compared to Claude Opus 4.8, for example, Qwen 3.7 Max is roughly half the price on input and less than a third on output. Compared to GPT-5.5, the gap is even wider.

‍

There is one caveat: Qwen 3.7 Max is a reasoning model that thinks heavily before answering. In Artificial Analysis testing, the model generated about 97 million tokens to complete the Intelligence Index evaluation — roughly 4x the 24 million-token average for other models on the benchmark, so usage will be more expensive than flat token prices suggest.

‍

To offset the per-task cost, Qwen 3.7 Max supports prompt caching at $0.25 per million tokens — a 90% discount on cached input. For workloads that re-use long system prompts or reference documents across turns, that drops the effective input cost dramatically.

‍

Or, if you want to skip the API math entirely, you can chat with Qwen 3.7 Max on Overchat AI as part of a single subscription that also includes Claude Opus 4.8, GPT-5.5, Gemini 3.5 Flash, Kimi K2.6, and more.

‍

Qwen 3.7 Limitations

There are two limitations worth knowing about before you choose Qwen 3.7 models.

‍

Qwen 3.7 Max is text-only. Qwen 3.7 Max does not handle image or audio input. For multimodal workloads, you need to switch to Qwen 3.7 Plus.

‍

Somewhat high abstention rate for a frontier model. This means that when the model is unsure about something, instead of hallucinating an answer, it will stop from doing the task, which lowers the overall accuracy because sometimes the model refuses to continue.

‍

To explain further, Qwen 3.7 Max's hallucination rate on the AA-Omniscience benchmark dropped 21.3 percentage points compared to Qwen 3.6 Max, which is good. But raw accuracy on the same benchmark also dropped 7.6 points, and the model's attempt rate fell from 67.3% to 48.0%. In real life, you’ll get answers like "I don't know" more often than with other models, and it will refuse to answer some edge cases where other AI will at least try. Whether this will impact you materially is very hard to say until you try working with it.

‍

When Should You Use Qwen 3.7?

Reach for Qwen 3.7 Max when:

‍

You're doing math or scientific reasoning, where Qwen 3.7 Max is among the strangers models in 2026.
You're working on complex, multi-step problems that benefit from very long reasoning chains.
You're already on Claude Code and want a cheaper backend without changing your stack.

‍

Reach for something else when:

You want to chat with images, videos, or audio files. We’d suggest using Gemini 3.5 Flash instead.
You need the absolute top of the leaderboard model disregarding the cost — use Claude Opus 4.8 or GPT-5.5.
You need open weights for self-hosting or fine-tuning — use Kimi K2.6 or DeepSeek V4 Pro.
You like chatting with AI and want to get answers fast — the heavy thinking on every answer will quickly start feeling sluggish. Use a non thinking model, such as a non-thinking version of DeepSeek, Sonnet, Kimi, or GPT.

‍

FAQ

When was Qwen 3.7 Max released?

Qwen 3.7 Max was formally announced on May 20, 2026 at the Alibaba Cloud Summit in Hangzhou. The preview version first appeared on the LM Arena leaderboard around May 14, and API access opened on May 19.

‍

How much does Qwen 3.7 Max cost?

API pricing is $2.50 per million input tokens and $7.50 per million output tokens, with cached input dropping to $0.25 per million. That makes Qwen 3.7 Max roughly half the price of Claude Opus 4.8 on input and less than a third on output.

‍

Is Qwen 3.7 Max open source?

No. Qwen 3.7 Max is proprietary and closed-weight, available only through partner platforms like Overchat AI, and Alibaba Cloud Model Studio.

‍

What's the difference between Qwen 3.7 Max and Qwen 3.7 Plus?

Qwen 3.7 Max is a more powerful reasoning model of the two, but it is text-only. Qwen 3.7 can also understand images and videos, and it’s designed to balance power and speed, better suited to everyday tasks like chatting.

‍

Which is better, Qwen 3.7 vs GPT-5.5?

GPT-5.5 is a better model for most tasks, although it is also twice as expensive per token. For example, in the SWE-Bench Verified test, which measures coding performance, GPT-5.5 scores 88.7%, while Qwen-3.7 Max scores 80.4%. This percentage shows how many coding tasks the model was able to solve. GPT-5.5 typically outperforms Qwen 3.7 Max on almost all benchmarks, except for SWE-Bench Pro, where Qwen 3.7 Max slightly outperforms GPT-5.5 with 60.6% vs 58.6% of tasks solved. SWE-Bench Pro is a collection of very challenging coding tasks.

‍

Bottom Line

Qwen 3.7 Max is Alibaba's state-of-the-art reasoning model, which, upon its release, became the highest-ranked Chinese model on the Artificial Analysis Intelligence Index. While it doesn’t outperform western flagships on most tasks, it comes very close to the performance levels of Claude Opus 4.8 and GPT-5.5, while costing half to a quarter as much. It also adds a 1M context window, sharper reasoning and native Claude Code compatibility.

‍

To test Qwen 3.7 Max on your own workflows today, head to Overchat AI and start chatting with Qwen 3.7 Max.