Digestly

Feb 3, 2025

Deep Research by OpenAI - The Ups and Downs vs DeepSeek R1 Search + Gemini Deep Research

AI Explained - Deep Research by OpenAI - The Ups and Downs vs DeepSeek R1 Search + Gemini Deep Research

OpenAI's Deep Research, powered by their 03 model, is designed to handle complex and obscure knowledge tasks. It performed well on benchmarks like 'Humanity's Last Exam,' showing significant improvement over previous models. However, it struggles with basic reasoning and common sense tasks, often asking multiple clarifying questions instead of providing direct answers. The model is particularly useful for finding specific information in large datasets, as demonstrated in a test involving newsletter posts. Despite its strengths, it sometimes hallucinates or provides incorrect information, especially when tasked with practical applications like price history research. The model's performance highlights the rapid advancements in AI capabilities, though it still falls short of human-level understanding in many areas.

Key Points:

  • Deep Research excels in handling obscure knowledge tasks, outperforming previous models significantly.
  • The model struggles with common sense reasoning, often asking multiple clarifying questions.
  • It is useful for finding specific information in large datasets, saving time in research tasks.
  • Despite improvements, the model can hallucinate or provide incorrect information in practical applications.
  • AI advancements are rapid, but human-level understanding remains superior in many areas.

Details:

1. ๐Ÿ” Introduction to Deep Research by OpenAI

  • OpenAI has launched Deep Research, a powerful language model-based system, 12 hours ago, intended for various applications.
  • The product functions as an agent, having been tested on 20 distinct use cases to evaluate its versatility and effectiveness.
  • In competitive benchmarks, Deep Research was compared against Deep Seek R1 and Google's Deep Research, highlighting its capabilities.
  • Access requires a $200 monthly subscription and is geographically restricted in Europe, necessitating a VPN for access.
  • Initial feedback has been positive, though significant caveats exist regarding its application scope.
  • The economic value of tasks performed by Deep Research remains uncertain, suggesting an area for further analysis and exploration.

2. ๐Ÿง  Exploring and Testing OpenAI's Deep Research

2.1. OpenAI's Deep Research Model

2.2. Humanity's Last Exam Benchmark

2.3. GUAI Benchmark Insights

2.4. Performance Insights

3. ๐Ÿงช Benchmark Comparisons and Initial Performance

  • The deep research model often asks multiple clarifying questions (four or five on average) instead of answering directly, which can be seen as either a flaw or a sign of advanced reasoning akin to AGI.
  • Performance in spatial reasoning and common sense tests showed little to no improvement, with the model failing to provide satisfactory answers in simple benchmark tests.
  • The model resorted to citing obscure websites during reasoning tasks, leading to unsatisfactory and indirect problem-solving.
  • A practical tip for users encountering model indecisiveness is to refresh and select another model, which can clear the log jam and allow continued testing.

4. ๐Ÿ“Š Deep Dive into Performance Metrics

  • Deep seek R1 was effective in quickly identifying two posts with a dice rating of five or above from a newsletter with less than 10,000 readers, demonstrating its time-saving capability.
  • R1 in perplexity Pro showed limitations by failing to find entries with the desired dice rating, indicating potential areas for improvement in its search functionality.
  • Deep research is preferred for complex queries due to its comprehensive results, despite its cost, offering 100 queries per month on the pro tier, making it suitable for extensive research needs.
  • The free tier of deep research is impractical for frequent use due to its very limited number of queries per month.
  • Gemini advance's deep research tool was ineffective, failing to locate dice ratings, leading to its exclusion from further tests.
  • Despite frequent hallucinations, deep research consistently outperformed deep seek R1, providing more reliable results, which suggests its utility in detailed analysis tasks.
  • Efforts to customize models to prevent clarifying questions were unsuccessful, highlighting a limitation in model adaptability.
  • Identifying benchmarks where human performance exceeds current LLMs by double indicates the need for improvement in these models' capabilities.
  • The tool demonstrated its ability to find less recognized benchmarks, such as Simple bench, showcasing its potential in uncovering obscure performance metrics.

5. ๐Ÿ”— Real-world Applications and Model Limitations

  • Human coders significantly outperform current models, though models like 03 Mini achieve high performance in specific contexts, reaching the 90th percentile among participants.
  • Deep research models save time by effectively distinguishing between relevant and irrelevant information, improving information retrieval efficiency.
  • The DeepSeek R1 model's poor performance on benchmarks illustrates the existing gap between human and AI capabilities, especially in complex tasks.
  • Halo benchmark evaluations revealed hallucination issues in AI, with human evaluators achieving 85% accuracy. Initial reports of GPT-4 Turbo achieving 40% accuracy were later contested.
  • Deep research models excel with obscure language questions, scoring 88% without prior data, outperforming GPT-40, which scored 82% despite having direct access to source material.
  • Smaller models struggle with large context, but deep research models utilize more compute resources to enhance accuracy in answering questions.
  • A prototype for article enhancement with research directions was quickly outpaced by advanced deep research models.
  • The OpenAI presentation lacked detail on deep research browsing capabilities, limiting understanding of its full potential.

6. ๐Ÿ›’ Evaluating Deep Research for Consumer Advice

  • The system effectively identified the correct video where OpenAI's valuation was predicted to double by sourcing quotes from external platforms rather than directly searching YouTube, demonstrating its ability to triangulate data from various sources.
  • Despite this success, the system's inability to directly access YouTube for precise timestamps is a notable limitation, impacting its capability to verify video content accurately.
  • This limitation suggests the need for improved integration with video platforms to enhance accuracy in sourcing and timestamping content.

7. ๐Ÿค– AI Hallucinations and Future Outlook

  • The AI was tasked with finding a highly-rated toothbrush in the UK with a battery life of over 2 months, using a specified site to verify price history.
  • Despite being given the specific website (camelcamelcamel.com) for price history, the AI provided links not corresponding to the site and falsely claimed to have used it.
  • The AI quoted a toothbrush price inaccurately, stating it had been ยฃ66 when it was actually ยฃ63, showing unreliability in its research claims.
  • The AI claimed the battery life of the toothbrush was 70 days, whereas it was actually 30 to 35 days, demonstrating a significant hallucination in the data provided.
  • The AI provided a hypothetical historical low price of ยฃ40 for the toothbrush without verifying it from the actual site, misleadingly stating it as fact in summaries.
  • These hallucinations highlight the risks of relying on AI for accurate information retrieval, emphasizing the need for robust verification processes.
  • AI hallucinations can undermine user trust and have broader implications for the adoption of AI technologies in critical applications.
  • To mitigate these risks, implementing rigorous testing, validation, and cross-referencing protocols can enhance AI reliability in information retrieval tasks.

8. ๐Ÿ”ฎ The Rapid Advancement of AI and Its Implications

8.1. AI's Impact on White-Collar Jobs

8.2. AI in Media and Content Creation

8.3. Challenges in AI Information Processing

View Full Content
Upgrade to Plus to unlock complete episodes, key insights, and in-depth analysis
Starting at $5/month. Cancel anytime.