DeepSeek R1 matches or beats OpenAI o1 on MATH-500, AIME 2024, and Codeforces at 27× lower cost. We break down the architecture, benchmark data, and exact pricing for reasoning workloads.
DeepSeek R1 changed the economics of AI reasoning when it launched in January 2025. A model that matches OpenAI o1 on most benchmarks at roughly 3% of the price is not an incremental improvement — it is a structural shift in how we think about the cost of intelligence. Twelve months later, R1 remains the default choice for any application that needs serious reasoning capability without a serious budget.
How DeepSeek R1 Works
R1 is a 671 billion parameter Mixture-of-Experts model that activates only 37 billion parameters per forward pass. That architecture matters for cost: inference scales with active parameters, not total model size. The 671B figure is its knowledge capacity; the 37B is what you actually pay for at inference time.
The defining feature is its extended thinking process. Unlike a standard language model that produces output in a single autoregressive pass, R1 generates a long internal chain-of-thought before producing its final answer. This reasoning scratchpad can run to thousands of tokens — visible in the reasoning_content field of the API response. The final answer is typically concise; the thinking is verbose. You pay for both, which is why per-task cost is higher than simple models even at R1's low per-token rate.
DeepSeek used reinforcement learning with a process reward model to train R1. The model receives feedback on intermediate reasoning steps, not just final answers. This produces dramatically better results on problems requiring multi-step reasoning: mathematical proofs, algorithm design, financial modeling, and anything where getting the intermediate steps right is as important as the final conclusion.
Benchmark Performance
Here is how R1 compares to the major frontier models on standardized reasoning benchmarks:
| Benchmark | DeepSeek R1 | OpenAI o1 | GPT-4o | Claude Sonnet 4.5 |
|---|---|---|---|---|
| MATH-500 | 97.3% | 96.4% | 76.6% | 78.3% |
| AIME 2024 | 79.8% | 79.2% | 9.3% | 16.0% |
| Codeforces (Elo) | 2029 | 1891 | 759 | 717 |
| GPQA Diamond | 71.5% | 75.7% | 53.6% | 65.0% |
| LiveCodeBench | 65.9% | 63.4% | 34.2% | 38.1% |
| SWE-bench Verified | 49.2% | 48.9% | 38.4% | 53.7% |
On MATH-500 and AIME 2024 (competition mathematics), R1 outperforms o1. On Codeforces (competitive programming), R1's Elo rating of 2029 puts it in the top 2% of competitive programmers globally — well ahead of o1's 1891. GPQA Diamond (graduate-level science questions) is the one category where o1 holds a clear lead. SWE-bench (real GitHub issue resolution) is competitive across all frontier models, with Claude Sonnet leading.
The practical implication: for math, code, and multi-step logical reasoning, R1 is the best-value model available. For pure scientific knowledge recall, o1 has a narrow edge.
Pricing Comparison
This is where R1's value becomes undeniable.
| Model | Provider | Input (per 1M tokens) | Output (per 1M tokens) | vs DeepSeek R1 |
|---|---|---|---|---|
| DeepSeek R1 | DeepSeek | $0.55 | $2.19 | baseline |
| DeepSeek V3 | DeepSeek | $0.27 | $1.10 | 0.5× cost |
| Qwen 3 235B | Alibaba | $0.35 | $1.40 | 0.7× cost |
| Claude Sonnet 4.5 | Anthropic | $3.00 | $15.00 | 7× cost |
| GPT-4o | OpenAI | $5.00 | $15.00 | 9× cost |
| OpenAI o1 | OpenAI | $15.00 | $60.00 | 27× cost |
Consider a production reasoning pipeline processing 100 tasks per day, with each task using 2,000 input tokens and 6,000 output tokens (accounting for chain-of-thought):
- DeepSeek R1 via Nova: (2k × $0.55 + 6k × $2.19) / 1M × 100 = $1.42/day = $42.60/month
- OpenAI o1: (2k × $15 + 6k × $60) / 1M × 100 = $39/day = $1,170/month
That is a 27× cost difference for statistically equivalent performance on most reasoning tasks. At 1,000 tasks per day, the annual difference exceeds $130,000.
When to Use R1
Mathematical computation: MATH-500 at 97.3% means near-perfect accuracy on problems that stump most AI models. Use it for algebra, calculus, statistics, financial modeling, and formal mathematical proofs.
Competitive programming and algorithm design: A Codeforces Elo of 2029 is professional-level. For writing efficient algorithms, solving hard LeetCode problems, or reviewing code for subtle correctness issues, R1 is the most capable model at any price.
Multi-step reasoning chains: Legal analysis, scientific research, structured argument generation, complex debugging. Any task where the intermediate steps need to be correct, not just the conclusion.
Complex code review: R1 reasons through code behavior rather than pattern-matching, catching bugs and architectural issues that standard models miss.
When Not to Use R1
R1 is slower than standard models. Its chain-of-thought generation adds latency — expect 10 to 30 seconds for complex problems, sometimes longer. For tasks needing sub-second responses (autocomplete, chat suggestions, simple Q&A), use DeepSeek V3 or Qwen 3 8B instead.
R1 also has limited image understanding. If your workflow involves processing images or documents, pair R1 with a multimodal model for the visual extraction step.
Getting Started on Nova
from openai import OpenAI
client = OpenAI(
api_key="your_nova_api_key",
base_url="https://api.nova.ai/v1"
)
response = client.chat.completions.create(
model="deepseek/deepseek-r1",
messages=[
{"role": "user", "content": "Prove that the square root of 2 is irrational."}
],
)
# Both reasoning chain and final answer are available
reasoning = response.choices[0].message.reasoning_content
answer = response.choices[0].message.content
print(f"Reasoning ({len(reasoning)} chars): {reasoning[:300]}...")
print(f"\nAnswer: {answer}")The reasoning_content field contains the full chain-of-thought. For most applications you only need content. For debugging or explainability use cases, reasoning_content provides a complete audit trail of how the model reached its answer.
Conclusion
DeepSeek R1 is not a cheaper o1 — it is a model that matches or beats o1 on most reasoning benchmarks at 27× lower cost. For any team running reasoning workloads at scale, switching from o1 to R1 on Nova pays for itself within the first week. The engineering effort is two lines of code. The monthly savings are real and compounding.
Nova Team
Editorial Team at Nova