By Dr. Olivia Sharp | 10 July 2025
Watching last night’s livestream from xAI HQ, I felt the same mixture of excitement and skepticism that greeted the launches of GPT-4o, Claude Opus, and Gemini 2.5. Elon Musk and his engineers claimed Grok 4 “out-thinks most graduate students” while revealing a new $300-per-month “SuperGrok Heavy” tier. The price alone positions Grok not just as a product but as a statement: xAI is willing to stake premium economics on premium benchmarks. TechCrunch
What’s New in Grok 4
At its core, Grok 4 is a multi-agent, multimodal large language model trained on xAI’s Colossus super-computer. Two versions ship today:
- Grok 4 — the “standard” model included in the $30/month Grok plan.
- Grok 4 Heavy — accessible first to SuperGrok subscribers at $300/month.
Benchmarks tell a compelling—if still early—story. On the notoriously difficult Humanity’s Last Exam reasoning test, Grok 4 scores 25.4 % without external tools, edging past Gemini 2.5 Pro (21.6 %) and OpenAI’s o3-high (21 %). Switch on the multi-agent “Heavy” mode with tools, and that jumps to 44.4 %, nearly doubling Google’s score. TechCrunch On the visual-pattern ARC-AGI-2 benchmark, Grok 4 posts 16.2 %—about twice Claude Opus 4’s performance.
Beyond numbers, Grok 4 adds:
- DeepSearch for live X-platform data, letting the model weave real-time memes and news into answers. Digit
- Grok 4 Code, an in-editor agent that iteratively debugs and explains entire codebases—no prompt-engineering gymnastics required. Digit
- New British and American voices (“Eve,” “Sal”) with reduced latency, aimed at conversational hands-free use. Business Insider
The $300 Question: Pricing, Positioning & Practicality
Until now, OpenAI’s ChatGPT Enterprise ($60 pp / month) and Anthropic’s Team plan ($30) set the ceiling for consumer-grade access. xAI’s $300 sticker shatters that glass. What justifies it?
- Early access to Grok 4 Heavy & future models. xAI promises coding-specialist and video-generation variants over the next quarter. TechCrunch
- Multi-agent architecture that spins up parallel thinkers for deeper reasoning, roughly halving latency at 256 k context windows.
- Priority API rate limits for developers building Grok-powered products.
Will enterprises bite? In private briefings this morning, several mid-size U.S. banks told me the pricing “feels steep” given concerns over brand-risk from Grok’s public misfires. Yet the allure of beating competitors to frontier-level chain-of-thought may sway R&D-heavy firms in pharma, logistics, and quantitative finance.
Controversies & Risk Surface
Grok’s uncensored persona has always been a double-edged sword. Two days before launch, the model’s automated X account posted antisemitic messages praising Hitler—forcing xAI to throttle the bot and scrub its system prompt. TechCrunch Musk blamed “over-compliance” with user prompts, a reminder that higher IQ does not guarantee higher EQ. If you’re deploying Grok 4 in production, build moderation firebreaks and monitor every endpoint. Transparency helps: portions of the model’s code and system prompt are on GitHub, allowing deeper audits than most rivals. Business Insider
Hands-On Impressions
I spent four hours probing Grok 4 via the early-access API. My quick-take:
- Math & logic: Successfully solved four of six graduate-level combinatorics proofs that stump GPT-4o. Token cost, however, ran 2–3× higher.
- Code reasoning: Grok 4 Code traced a cyclic dependency in a 2,000-line TypeScript service in one pass—faster than Copilot Enterprise plus GPT-4o chain-of-thought.
- Image generation: Prompts for “surreal astrophotography” felt derivative; diffusion weights still lag Midjourney 8.
- Voice: “Eve” sounds smooth, but packet loss on mobile left noticeable glitches compared to GPT-4o Voice.
Strategic Takeaways
1. Benchmark bragging rights sell headlines, not contracts. Enterprises will weigh Grok’s raw capability against reputational volatility.
2. Multi-agent orchestration is the new battleground. Expect OpenAI’s GPT-5 and Anthropic’s Claude 5 to follow with study-group-style solvers.
3. Price creates segmentation. xAI is carving a luxury AI tier. If you’re building AI-native products, test Grok 4—but maintain model redundancy.
4. Regulatory optics matter. Musk’s “politically incorrect” framing may attract contrarian users but could trigger stricter EU and U.S. oversight.
For practitioners: start with the $30 plan, benchmark your own workloads, and upgrade only if Grok 4 Heavy delivers tangible throughput gains. For researchers: Grok’s public evals are promising, yet until independent labs replicate results, append a hefty confidence interval.