Digestly

Feb 28, 2025

GPT 4.5 - not so much wow

AI Explained - GPT 4.5 - not so much wow

GPT 4.5 was expected to be a significant advancement in AI, focusing on scaling up the base models with more parameters and data. However, it underperformed in various benchmarks, including science, mathematics, and coding, compared to smaller models like Claude 3.7. The model's emotional intelligence was also tested, revealing a tendency to overly sympathize with users, sometimes missing critical cues of inappropriate behavior. This was contrasted with Claude 3.7, which demonstrated a higher emotional intelligence by setting boundaries and recognizing fictional scenarios. Additionally, GPT 4.5's creative writing and humor capabilities were found lacking compared to Claude 3.7, which showed more nuanced and engaging outputs. Despite these shortcomings, GPT 4.5 is seen as a foundation for future reasoning models, with OpenAI planning to enhance it through reinforcement learning. The high cost of GPT 4.5, however, raises questions about its long-term viability, especially when compared to more cost-effective models like Claude 3.7.

Key Points:

  • GPT 4.5 underperforms in benchmarks compared to Claude 3.7, especially in science and coding.
  • The model's emotional intelligence is questionable, often failing to recognize inappropriate scenarios.
  • Creative writing and humor are less effective in GPT 4.5 compared to Claude 3.7.
  • Despite high costs, GPT 4.5 is intended as a foundation for future reasoning models.
  • OpenAI plans to improve GPT 4.5 through reinforcement learning, but its current cost-effectiveness is debated.

Details:

1. 🔮 The Evolution and Promise of Language Models

1.1. GPT 4.5 and Scaling Challenges

1.2. Implications of 'Extended Thinking Time'

2. 💡 GPT 4.5: Features, Costs, and Access

2.1. GPT 4.5 Features and Access

2.2. Performance and Benchmark Comparison

2.3. Pricing and Feature Accessibility

3. 🤔 Emotional Intelligence and Ethical Implications

  • GPT 4.5 struggles with detecting spousal abuse masked as humor, initially failing to address harmful behavior directly.
  • Claude 3.7 provides a more nuanced response by identifying harmful behavior and offering resources for relationship support.
  • GPT 4.5 tends to overly sympathize with users, even in morally questionable scenarios, while Claude recognizes fictional or harmful elements and sets boundaries.
  • Emotional intelligence in AI should include the ability to set boundaries and recognize when a user is testing the system.
  • OpenAI emphasizes emotional intelligence for GPT 4.5, costing $200 for access, with the implication that it's valuable for deep research use cases.

4. ✍️ Creativity and Humor: A Comparative Analysis

  • GPT-4.5 tends to tell rather than show, with descriptions such as 'gentle yet spirited' without demonstrating actions, whereas Claude 3.7 attempts to show rather than tell by describing scenes vividly, like a 'sky heavy with the promise of rain.'
  • In creative writing, Claude 3.7 has a slight edge over GPT-4.5 due to its ability to convey through showing rather than telling.
  • For humor, GPT-4.5 provides a humorous scenario of a YouTuber being outperformed by AI, leading to skyrocketing video views, but relies on being told the situation rather than being shown.
  • Claude's humor elicited a laugh and is perceived as more effective in showing rather than telling, providing a more engaging experience.

5. 💰 Economic Viability and Model Efficiency

5.1. GPT Model Cost Analysis

5.2. Future Considerations and Potential Efficiencies

6. 🔍 Comprehensive Performance Benchmarks

  • GPT 4.5 scored approximately 35% on the Simple Bench test, a notable improvement from GPT 4 Turbo's 25% and GPT 4's 18%. This demonstrates a clear advancement in performance metrics.
  • The Simple Bench test consists of hundreds of questions and is designed to minimize natural fluctuation. Each model is run five times to ensure consistency in results.
  • Extended Thinking by Claude 3.7 achieved a 48% score early in testing, indicating its strong potential in reasoning tasks.
  • Anthropic's models are gaining recognition for coding usability and emotional intelligence, suggesting their future potential in expanding reasoning capabilities.
  • The enhanced base model of GPT 4.5 is expected to foster better reasoning models, analogous to how individuals with higher IQs perform better with prolonged thinking.
  • OpenAI's strategic focus with GPT 4.5 is to lay a robust foundation for developing advanced reasoning and tool-using agents in future versions.

7. 🔄 Future Directions in Model Development

  • Pre-training is no longer the optimal use of computational resources, as stated by the former Chief Research Officer at OpenAI.
  • Reasoning is identified as the next area of focus for 2025, with potential for higher returns compared to pre-training.
  • Increasing the base model size, such as with GPT-4.5, requires 10 times more compute for marginal intelligence gains.
  • Reasoning, especially using reinforcement learning (RL) and chains of thought, provides significantly higher returns.
  • There's an acknowledgment that reasoning might eventually face diminishing returns, similar to pre-training.
  • The potential limits of reasoning in terms of returns may become evident by the end of the current year.
  • An OpenAI employee mentioned that reasoning represents the end of an era, highlighting its significance in future model development.
  • Implementing reasoning involves challenges such as ensuring models can effectively simulate human-like thought processes.
  • OpenAI is exploring reinforcement learning to enhance reasoning capabilities, aiming for models that better understand context and deliver more accurate predictions.
  • There's a strategic shift from scaling models to enhancing their cognitive abilities, reflecting a fundamental change in AI development priorities.

8. 📈 The Challenges of Scaling and Limitations

8.1. Scaling Challenges and Model Limitations

8.2. Unexpected Success of Reasoning Models

9. 📝 Insights from the System Card

  • GPT 4.5 utilized automated evaluations instead of human red teaming due to prior performance issues, showing a strategic shift towards automation in safety assessments.
  • GPT 4.5 could persuade the GPT 40 model to donate money, often by requesting small amounts, highlighting its capability in securing frequent but smaller donations.
  • In research engineer interview questions, GPT 4.5 showed only a 6% improvement over GPT 40, indicating marginal gains in complex reasoning tasks.
  • In sbench tests, GPT 4.5 scored 4-7% higher than GPT 40, suggesting limited advancements in specific benchmarks.
  • GPT 4.5 improved by 6% in autonomous agentic tasks over GPT 40, but still did not meet the projected expectations for 2025 performance levels.
  • In MLE benchmarks, GPT 4.5 achieved an 11% score compared to GPT 40's 8%, indicating some progress in model self-improvement capabilities.
  • The model's pull request performance was slightly better than GPT 40's with a 1% increase, yet significantly outperformed by Deep Research's 42% success rate.
  • Despite broader world knowledge claims, GPT 4.5 was outperformed in language tasks by O Series models, challenging assumptions about its comprehensive capabilities.

10. 🔍 Reflecting on GPT 4.5 and Industry Trends

  • Andre Karpathy highlighted five examples where GPT 4.5 surpassed GPT 4, but a poll showed people preferred GPT 4 four out of five times, indicating mixed reactions to the new model.
  • Overhyping of AI advancements serves as a cautionary tale, suggesting that while technical improvements are significant, user perception and acceptance are equally important.
  • Despite the mixed reactions, GPT 4.5 is seen as a significant step forward on many benchmarks, suggesting potential for further advancements with reinforcement learning in the future.
  • There is a shift in focus among CEOs from scaling pre-training to having a better handle on data mixture, with Anthropics being noted for potentially having an edge over OpenAI in this aspect.
View Full Content
Upgrade to Plus to unlock complete episodes, key insights, and in-depth analysis
Starting at $5/month. Cancel anytime.