Alibaba's Qwen 3 235B scores 93.2% on MATH-500 and 92.8% on HumanEval — both above GPT-4o — at $0.35 per million input tokens. Here's what it's best at and how to use it on Nova.
Qwen 3 235B is the flagship model from Alibaba DAMO Academy, and it arrived in early 2025 with benchmark scores that matched or exceeded GPT-4o and Claude Sonnet on general language tasks at 14× lower cost. It is not widely discussed outside developer circles that follow open-source model releases, which means most teams paying frontier prices have not considered it yet. That is a significant oversight.
Architecture
Qwen 3 235B uses a Mixture-of-Experts architecture with 235 billion total parameters and 22 billion active per forward pass. That design gives it the knowledge capacity of a massive model at the inference cost of a much smaller one. Alibaba trained it on 36 trillion tokens with strong representation of code, mathematics, scientific papers, and text in over 100 languages.
The result is a model that is unusually strong on structured reasoning — significantly better than GPT-4o on math benchmarks — while maintaining the instruction-following quality that makes proprietary models useful in production.
Performance
| Benchmark | Qwen 3 235B | GPT-4o | Claude Sonnet 4.5 | DeepSeek R1 |
|---|---|---|---|---|
| MMLU | 88.9% | 87.2% | 88.3% | 90.8% |
| MATH-500 | 93.2% | 76.6% | 78.3% | 97.3% |
| HumanEval | 92.8% | 90.2% | 88.9% | 92.9% |
| LiveCodeBench | 70.1% | 34.2% | 38.1% | 65.9% |
| GPQA Diamond | 65.0% | 53.6% | 65.0% | 71.5% |
The MATH-500 score of 93.2% is the headline. That places Qwen 3 235B within 4 points of DeepSeek R1 — the top reasoning model — at roughly 65% of R1's cost. For technical workloads where you need strong math but do not require R1's full chain-of-thought reasoning, Qwen 3 235B is often the most efficient choice.
LiveCodeBench is also striking: 70.1% versus GPT-4o's 34.2%. On real competitive programming problems, Qwen 3 235B is roughly twice as capable as GPT-4o at one-fourteenth the price.
Pricing Across the Model Family
| Model | Input | Output | Best for |
|---|---|---|---|
| Qwen 3 235B | $0.35/M | $1.40/M | Complex reasoning, code, technical writing |
| Qwen 3 72B | $0.18/M | $0.72/M | General chat, summarization, translation |
| Qwen 3 32B | $0.09/M | $0.36/M | Fast responses, simple Q&A |
| Qwen 3 8B | $0.06/M | $0.24/M | Classification, extraction, bulk tasks |
The full model family gives you flexibility to optimize cost by task complexity. Start with 235B for quality evaluation, then systematically test smaller sizes. Many tasks that feel like they require 235B work fine with 72B — at half the cost.
Best Use Cases
Technical documentation and code generation: Qwen 3's HumanEval score of 92.8% makes it one of the top models for code. It handles complex multi-file refactors, writes accurate unit tests, and explains technical concepts precisely.
Mathematics and quantitative reasoning: MATH-500 at 93.2% is near-specialist level. Use it for financial modeling, statistical analysis, scientific computation, and any multi-step arithmetic problem.
Multilingual applications: Trained on 100+ languages with particular depth in Chinese, Japanese, Korean, and European languages. For products serving non-English markets, Qwen 3's multilingual capability is unmatched at this price point.
High-volume workloads at the 8B tier: At $0.06/M input tokens, the 8B variant processes enormous volumes at essentially no cost. Classification, intent detection, entity extraction — tasks that do not need frontier reasoning — are ideal candidates.
Getting Started on Nova
from openai import OpenAI
client = OpenAI(
api_key="your_nova_api_key",
base_url="https://api.nova.ai/v1"
)
response = client.chat.completions.create(
model="qwen/qwen3-235b",
messages=[{
"role": "user",
"content": "Write a Python function implementing binary search with full error handling."
}],
)
print(response.choices[0].message.content)The model identifier on Nova is qwen/qwen3-235b. Use qwen/qwen3-8b for high-volume tasks where cost matters most and complexity is low.
Why It Is Underrated
Qwen 3 has not received the same coverage as DeepSeek in Western developer circles, but the benchmarks are unambiguous. It outperforms GPT-4o on every text task except general chat quality — and even there, the gap on MT-Bench is less than 0.2 points. For technical applications, it is objectively better and costs 14× less.
If you are building anything involving code, math, or multilingual text, benchmark Qwen 3 235B against your current model this week. It will almost certainly pass your quality bar at a fraction of the cost.
Nova Team
Editorial Team at Nova