Digestly

Jan 19, 2025

Everything you need to run Mission Critical Inference (ft. DeepSeek v3 + SGLang)

Latent Space: The AI Engineer Podcast - Everything you need to run Mission Critical Inference (ft. DeepSeek v3 + SGLang)

Everything you need to run Mission Critical Inference (ft. DeepSeek v3 + SGLang)
DeepSeek V3, launched by Chinese labs, is a 671 billion parameter model with 256 experts, trained using native FP8 mixed precision and multi-token prediction objectives. It processes 15 trillion tokens, including synthetic reasoning data. Rated as the 7th best model globally, it surpasses many in performance, making it the top open weights model as of January 2025. The model's size and complexity pose challenges in serving, requiring advanced infrastructure like Base10's H200 clusters to manage inference workloads efficiently. Base10's infrastructure supports mission-critical AI applications by offering dedicated resources and advanced scaling capabilities, ensuring high performance and compliance. The podcast also discusses the importance of frameworks like SG-Lang, which optimize model performance and usability, and the significance of speculative decoding techniques in enhancing AI inference efficiency.

Key Points:

  • DeepSeek V3 is a 671 billion parameter model, leading in open-source AI performance.
  • Base10 uses H200 clusters to efficiently serve large models like DeepSeek V3.
  • SG-Lang framework enhances model performance and usability, supporting advanced AI features.
  • Speculative decoding and quantization techniques improve AI inference efficiency.
  • Base10 provides dedicated resources for mission-critical AI applications, ensuring compliance and performance.

Details:

1. 🌟 DeepSeek V3: A New AI Powerhouse

  • DeepSeek V3 is a 671 billion parameter model with 256 experts, utilizing native FP8 mixed precision training and multi-head latent attention.
  • It employs a new multi-token prediction objective and was trained on 15 trillion tokens, including synthetic reasoning data from DeepSeek R1.
  • Currently ranked 7th on the LM Arena leaderboard with a score of 1319, it is the best open weights model globally as of January 2025.
  • There is a trend of Chinese labs releasing large open weights models, such as Tencent's Hunyuan and Hailuo's Minimax text, both over 400 billion in size.

2. ⚙️ Overcoming Model Deployment Challenges

  • Deploying extra-large language models efficiently is a significant challenge due to their size and complexity.
  • Base10 was able to lead in deploying DeepSeq v3 by leveraging the high bandwidth (4.8 TB per second) of their H200 clusters, which facilitated efficient model inference.
  • Collaboration with NeoCloud startups was crucial for Base10, enabling the deployment of DeepSeq v3 through shared resources and expertise.
  • The use of 8 H200s in a node was identified as a solution to perform inference of DeepSeq V3 in FP8, addressing the model's KV cache requirements.

3. 🤝 Base10's Inference Expertise

  • Base10 has been involved since the inception of latententspace, supporting the first Demo Day in San Francisco, which led to the creation of the current podcast.
  • Philip Kiley led a well-attended workshop on Tensor RT-LLM at the 2024 World's Fair, showcasing Base10's leadership in AI and machine learning.
  • Base10's representatives, including Amir and lead model performance engineer Yining Zhang, have extensive experience with DeepSeq and SGLang, running mission-critical inference workloads at scale for major AI products.

4. 📅 Exciting AI Events and Announcements

4.1. AI Events Overview

4.2. AI Engineer Summit Details

5. 🔍 Insights into DeepSeek V3 and SGLang

5.1. DeepSeek V3 Overview

5.2. Challenges with DeepSeek V3

5.3. Technical Requirements for Running DeepSeek V3

5.4. Model Adoption and Usage

5.5. Reasons for Adopting DeepSeek V3

5.6. Quantization Trends in Model Training

5.7. FP8 Training and Kernel Support

5.8. Future of MOE Models

5.9. Pricing and Resource Management

5.10. Running Models with SG-Lang and Trust

5.11. SG-Lang Performance and Features

5.12. SG-Lang Techniques

5.13. Constituted Decoding in SG-Lang

5.14. X-Grammar vs. Outlines

5.15. API Speculative Execution in SG-Lang

6. 📈 Essentials of Mission-Critical Inference

  • Traditional fine-tuning remains relevant and widely used, despite advancements in complex reasoning models.
  • Engaging with customers on current problems is key to understanding future challenges and developing proactive solutions.
  • Three pillars for mission-critical inference workloads are identified: model performance, infrastructure for horizontal scaling, and workflow enablement.
  • Model performance requires speculative decoding, draft model fine-tuning, and robust crash recovery systems.
  • Infrastructure must support rapid horizontal scaling, often necessitating multi-region and multi-cloud setups to meet latency needs.
  • Workflow enablement involves low-latency, multi-step, multi-model inference, crucial for mission-critical tasks.
  • Advancements in these pillars over the past year have significantly improved mission-critical inference workloads.
  • Mission-critical workloads must comply with latency, throughput, and geo-aware routing to meet customer SLAs and standards like HIPAA.

7. 🎉 Final Thoughts and Community Impact

7.1. Creating a Strategic Manifesto

7.2. Examples of Successful Initiatives and Community Engagement

View Full Content
Upgrade to Plus to unlock complete episodes, key insights, and in-depth analysis
Starting at $5/month. Cancel anytime.