Digestly

Jan 13, 2025

Evolving MLOps Platforms for Generative AI and Agents with Abhijit Bose

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) - Evolving MLOps Platforms for Generative AI and Agents with Abhijit Bose

Evolving MLOps Platforms for Generative AI and Agents with Abhijit Bose
Capital One is embedding AI across its operations, using proprietary solutions to improve fraud detection and customer service for over 100 million customers. The company has developed a robust AI and ML platform on AWS, enabling data scientists to build and deploy models efficiently. This platform supports both traditional machine learning and generative AI, with a focus on observability and governance. Capital One's approach includes fine-tuning open-source models like LAMA for specific tasks, ensuring data security by hosting models within their AWS environment. The company is also exploring agentic workflows to automate routine tasks, enhancing efficiency in areas like customer service and software development. They emphasize the importance of a platform-centric approach to optimize resources and maintain governance, while also fostering a strong data culture. Capital One is committed to continuous innovation, leveraging both internal and open-source tools to stay at the forefront of AI advancements.

Key Points:

  • Capital One uses AI to improve fraud detection and customer service, impacting over 100 million customers.
  • The company has built a comprehensive AI and ML platform on AWS, supporting both traditional and generative AI.
  • They fine-tune open-source models like LAMA for specific tasks, ensuring data security by hosting them internally.
  • Agentic workflows are being developed to automate routine tasks, enhancing efficiency in customer service and software development.
  • Capital One emphasizes a platform-centric approach for resource optimization and governance, fostering a strong data culture.

Details:

1. 🔍 Exploring AI Innovations at Capital One

1.1. AI Integration and Applications

1.2. Impact on Customer Experience

1.3. Strategic AI Investments

2. 🎙️ Meet Abhijit Bose: Capital One's AI Leader

  • Creating a feedback loop from the IVR system to data models can become complex when deploying Generative AI, highlighting the need for sophisticated integration strategies.
  • Deployment of Generative AI involves chaining events, a challenging task often overlooked, requiring careful planning and execution.
  • Abhijit Bose, as the leader of enterprise AI and ML platforms at Capital One, emphasizes the importance of addressing these challenges to improve AI deployment efficiency.
  • Capital One's strategy includes refining feedback loops and event chaining to enhance model accuracy and customer experience, demonstrating a commitment to innovative AI solutions.

3. 🏗️ Building AI Platforms: Challenges and Achievements

  • Capital One has developed core AI/ML infrastructure that supports classic models like GBMs and neural networks, as well as Gen AI applications such as LLMs.
  • The focus is on building and deploying AI/ML infrastructure to support enterprise-level operations.
  • Recent advancements include the integration of generative AI applications involving large language models (LLMs) and agentic workflows.
  • Capital One has addressed significant challenges related to scalability and data privacy in AI deployment.
  • The development process incorporates cutting-edge technologies, such as cloud computing and containerization, to enhance efficiency and flexibility.
  • Specific tools used include Kubernetes for orchestration and TensorFlow for model development, which have streamlined the development cycle and improved deployment efficiency by 30%.

4. 🔄 Rebuilding ML Stack and Operational Responsibilities

  • Over four years, the ML stack was rebuilt from the ground up in AWS, now serving almost all data scientists and engineers at the company, indicating a comprehensive infrastructure overhaul.
  • The ML platform is critical, supporting functions such as credit approval and fraud detection, showing its operational significance and integration into core business processes.
  • Building MLOps capabilities was a strategic focus, enhancing operational efficiency and reliability, which is crucial for maintaining competitive advantage.
  • The introduction of Gen AI dramatically expands the scope and possibilities of ML applications, providing opportunities for innovation and improved decision-making across the company.

5. 🔍 Comparing AI Experiences: Capital One vs Facebook AI Research

  • Managed large, distributed teams across Facebook AI Research's East Coast labs in New York, Montreal, and Pittsburgh, indicating proficiency in handling complex, multi-location projects.
  • Founded multiple engineering teams at Facebook AI Research, demonstrating strong leadership and organizational skills crucial for building and scaling innovative teams.
  • The difference in scale between Facebook AI Research and Capital One implies varying resource allocation and project execution strategies, with Facebook operating on a significantly larger scale.
  • Specific metrics or examples of successful projects or methodologies implemented at these organizations would provide deeper insights and practical takeaways.
  • Understanding the strategic differences in AI research approaches between these entities can inform better resource management and team structuring in similar contexts.

6. 🏦 Capital One: A Tech Company in Banking

  • Capital One employs a research-to-production path similar to academic research in AI, focusing on areas like computer vision, NLP, and robotics.
  • Prototype systems developed in research stages are scaled up to company-wide use, indicating an efficient transition from research to practical application.
  • Capital One positions itself as a platform company, emphasizing the importance of centralized, enterprise-scale platforms, drawing parallels with tech companies like Facebook.
  • There is a strong emphasis on platform development and a robust data culture, reinforcing Capital One's identity as a tech company that operates in the banking sector.
  • The company's investment in platform development, particularly in Gen AI platforms, highlights its commitment to leveraging technology for enhanced operational capabilities.

7. 📊 The Power of Centralized Platforms

  • Centralized platforms allow for building AI capabilities that the entire company can leverage, optimizing the use of expensive resources like GPUs for high-leverage use cases.
  • Investing in centralized platforms helps apply governance and controls in a unified manner, reducing the complexity of federated setups across the company.
  • The machine learning platform at this organization has achieved the highest NPS score among all platforms, indicating high user satisfaction and successful addressing of user pain points.
  • Thousands of users within the company benefit from the centralized platform, highlighting its broad impact and effectiveness in solving challenges centrally.

8. 🧑‍💻 User Personas and Platform Adaptations

  • The platform supports various user levels, allowing interaction at a notebook or developer level and enabling direct application consumption like forecasting.
  • It accommodates different user personas, including power users who seek control over their machine learning workflows by using SDKs to code models and features independently.
  • The platform facilitates a self-service environment, allowing sophisticated users to deploy models without needing platform engineers.
  • Efforts have been made to ease the distribution of compute resources across CPUs and GPUs, including for large language models (LLMs).
  • The platform also caters to users preferring a low-code or no-code environment, where they can connect to data and add steps graphically.
  • Behind the scenes, these low-code or no-code interactions are automatically converted into code and executed on the platform.
  • The platform addresses the needs of a range of users by providing both technical depth and simplified interfaces.

9. 🌐 AWS Integration and Platform Flexibility

  • The platform is being migrated into AWS, utilizing services like SageMaker while also building proprietary solutions on top of AWS infrastructure.
  • The platform's control plane is designed on Kubernetes, allowing integration of services from AWS, open source, and proprietary solutions, enhancing flexibility and innovation.
  • Unique IP is added in areas like automating governance steps, tailored to specific processes and model risk frameworks, which are not available in public platforms.

10. ⚙️ Extending ML Platforms for Gen AI Use Cases

10.1. ⚙️ Extending ML Platforms for Gen AI Use Cases

10.2. Technical Aspects of ML Platform Extension

10.3. Strategic Decisions and User Feedback

11. 🔍 Balancing Traditional ML and Gen AI

  • Capital One strategically invests in both traditional machine learning and generative AI, recognizing the importance of applying the right method for the right task.
  • Traditional machine learning remains critical for tasks such as predictive modeling and risk assessment, where established methods are reliable and require significant changes to switch approaches.
  • Generative AI is increasingly utilized in areas like customer service and fraud detection, where its ability to generate responses and patterns offers new efficiencies and insights.
  • Capital One's approach involves leveraging generative AI for its advanced capabilities while continuing to rely on traditional ML where it excels, ensuring that both technologies are used to their strengths for optimal results.

12. 🔬 Observability and Anomaly Detection in Gen AI

  • Observability in Generative AI is distinct from traditional machine learning due to challenges like LLM hallucinations, necessitating new guardrails and comprehensive logging of both inputs and responses.
  • To ensure effective agentic workflows, it's essential to monitor tool execution, including proper logging, governance, and execution tracking.
  • The complexity of Gen AI observability demands infrastructure beyond traditional anomaly detection and model monitoring, highlighting the need for advanced solutions.
  • Existing anomaly detection platforms are being adapted for Gen AI, enabling automatic alerts from diverse data sources with minimal configuration.
  • Gen AI observability is evolving, leveraging platforms to manage new data types beyond structured formats, a critical ongoing development area.

13. 🤖 Model Selection and Fine-Tuning Strategies

  • The strategy involves utilizing open source models like LAMA as a foundational base, which are then customized using proprietary data to meet specific regulatory and governance requirements.
  • Models undergo rigorous testing for hallucination rates and accuracy, tailored to their intended deployment tasks.
  • Models are hosted within the organization's AWS environment, ensuring data does not exit their secure perimeter, thereby enhancing data security.
  • There is a continuous benchmarking process against third-party hosted models to identify the most suitable model for specific use cases.
  • Hosting models internally is prioritized to protect data while maintaining the necessary accuracy for targeted applications.
  • The fine-tuning process involves specific adjustments to model parameters based on unique use case requirements, enhancing model performance and compliance with internal standards.
  • Benchmarking includes not only accuracy but also performance metrics such as latency and computational efficiency, ensuring models are optimized for real-world applications.

14. 🔍 Improving Customer Service with Fine-Tuned LLMs

14.1. Exploration of Fine-Tuning

14.2. Focus on Summarization and Customization

14.3. Use Case with LAMA

14.4. Traditional Customer Service Approach

14.5. Implementation of New System

15. 🔗 Complexities in Deploying Gen AI Systems

  • Human agents are integral to the feedback loop, as they manage conversations and provide real-time feedback, enhancing the AI system's accuracy. This involves approximately 20,000 agents, emphasizing the need for scalability in AI solutions.
  • The system incorporates feedback from human agents on responses generated by Large Language Models (LLMs) and Response Automation Components (RACs), which is critical for ongoing model refinement.
  • Creating an automated feedback loop from Interactive Voice Response (IVR) systems back to the AI model is complex, requiring careful data management and automation to maintain efficiency.
  • Automation is essential for compiling annotated datasets necessary for retraining models, highlighting the need for robust data handling systems. This includes automating the collection and integration of feedback into the model training process.

16. 🚀 HPC for Gen AI: Building Efficient Training Systems

  • High-performance computing (HPC) clusters are essential for training large language models (LLMs), requiring tightly interconnected file systems, high-speed networking, and multiple GPUs working together for extended periods.
  • Implementing checkpoint and restart capabilities is crucial to avoid wasting expensive hardware time due to GPU failures, which are common during the weeks-long training processes.
  • The creation of an in-house HPC environment on AWS was necessary, involving the use of AWS GPUs but building a custom fine-tuning stack on top of it.
  • Kubernetes is used as a base for the infrastructure, with involvement in the open-source Kubeflow project to support this setup.
  • There is a need for more community-developed tools for job scheduling and workload management to reduce the necessity of custom-built solutions.
  • Partnerships with services like AWS's Elastic Kubernetes Service (EKS) have simplified processes, yet additional custom layers are required for usability by scientists and engineers.

17. 🛠️ Data Annotation: Traditional ML vs Gen AI

  • In traditional machine learning, like fraud detection, data annotation involves labeling datasets manually or automatically, dividing them into training, validation, and test sets. This process is straightforward compared to generative AI models.
  • Generative AI models, especially large language models (LLMs), require a complex data annotation process, including crafting and evaluating input prompts and using LLMs as judges to score outputs.
  • LLM data annotation involves both human and automated evaluations, which adds layers of complexity not present in traditional ML annotation.
  • For example, while traditional ML may simply label a transaction as 'fraudulent' or 'non-fraudulent', LLMs need nuanced prompt responses and scoring mechanisms to train effectively.

18. 🔄 Automating Evaluation and Scoring

  • Automating evaluation and scoring processes enhances efficiency and accuracy in generative AI platforms.
  • Incorporating both market and open-source annotation tools is crucial for comprehensive evaluations.
  • A strategic focus is placed on automating these processes from production to training systems.
  • Embedding these automated tools within platforms can streamline workflows and improve scalability.
  • Examples of successful implementations include reducing manual effort by integrating automation tools, leading to a 30% increase in processing speed.

19. 🖥️ Optimizing Inference for Gen AI

  • Utilizing GPUs is essential for performing complex reasoning tasks effectively during inference.
  • Platforms are optimized using techniques such as caching, which stores frequently accessed data for quick retrieval, and speculative decoding, which anticipates possible future data needs to reduce latency.
  • Interdisciplinary collaboration between science and engineering teams is crucial for seamless model training to deployment processes.
  • Optimization efforts focus on reducing cost per token and latency from the project's inception, ensuring efficiency at scale.
  • Key performance indicators (KPIs) include minimizing cost per token and latency while expanding use cases and user base.
  • A specialized team of scientists and engineers is tasked with decreasing inference costs and latency, ensuring optimal performance.

20. 🔍 Exploring GPU Alternatives and Cloud Benefits

  • Effective utilization of GPUs in production systems involves maximizing memory and processing throughput, which is crucial for enterprise efficiency.
  • NVIDIA GPUs are predominantly used across enterprises; however, there is active exploration of alternatives like Google's TPUs, AWS's Tranium, and Inferentia.
  • AWS's Tranium and Inferentia present promising GPU alternatives, though NVIDIA remains the primary hardware due to its established performance.
  • Being entirely cloud-based on AWS since 2017, enterprises benefit from enhanced testing and learning from different architectures, offering flexibility and innovation over traditional on-prem setups.

21. 🔍 Innovation in Reasoning Models

  • The exploration of reasoning-oriented models involves examining both open-source models, such as Lama, and assessing their maturity and fit for purpose.
  • Key strategic decisions include whether to leverage existing open-source models or develop proprietary solutions based on current capabilities.
  • Continuous experimentation and assessment of trade-offs are emphasized to determine the most effective approach.
  • Criteria for evaluation include model performance, scalability, and integration capabilities.
  • Findings from experimentation indicate potential advantages in flexibility and innovation when using open-source models, while proprietary development may offer better alignment with specific organizational goals.

22. 🤖 Agentic Workflows: The Next Frontier

  • Agentic and multi-agentic workflows are poised to revolutionize task management by enhancing capabilities in intent understanding, planning, and execution, presenting a substantial technological opportunity.
  • Current research and platform development are focused on orchestrating agents to manage tasks such as back-office banking operations, contract document review, and software development, aiming to improve efficiency and accuracy.
  • Technologies like Langchain and Langraph are being integrated into platforms to address existing gaps in agentic frameworks, suggesting a trend towards more sophisticated and reliable solutions.
  • Agentic frameworks are anticipated to automate various aspects of the software lifecycle and other routine enterprise tasks, offering significant improvements in productivity and operational efficiency across sectors.

23. 🔄 Open Source Frameworks and Agentic Solutions

  • Capital One strategically aligns with open source frameworks such as Kubeflow, Kubernetes, and Open Telemetry to leverage community-driven enhancements for monitoring systems, demonstrating the power of open collaboration.
  • Engineers at Capital One actively contribute to open source projects like Langchain and Langraph, fostering a strong, reciprocal relationship with the open source community, enhancing innovation and resource sharing.
  • The adoption of open source frameworks is not absolute; Capital One customizes these frameworks to meet specific scalability challenges and regulatory requirements, ensuring suitability for its operational context.
  • Open source tools enable the orchestration of multiple agents, significantly improving scalability and the management of complex operations.
  • For instance, Capital One has modified Kubernetes to better handle its specific scalability needs, ensuring regulatory compliance and operational efficiency.
  • These adaptations highlight how open source solutions can be tailored to fit enterprise needs without sacrificing the benefits of community collaboration.

24. 💡 Early Results in Agentic Workflows

24.1. Efficiency Gains through Agentic Workflows

24.2. Role of CodeGen in Automating Tasks

25. 🛠️ Code Generation Tools and Governance

  • A standardized stack is employed for Code Generation (CodeGen), which prioritizes a structured methodology over the selection of individual tools, ensuring consistency and efficiency in development processes.
  • Investment in governance processes for AI and generative AI has been substantial, with a focus on evaluating third-party tools and establishing robust guardrails, including data handling and server communication protocols.
  • An evaluation framework is crucial before adopting new tools, covering aspects such as the deployment of code in production environments and CI/CD processes, which ensures quality and compliance with organizational standards.
  • The standardization strategy combines homegrown and off-the-shelf tools, allowing for both flexibility and control, tailored to meet specific task requirements effectively.
  • Fine-tuning models using proprietary data is a common practice to enhance task-specific accuracy and relevance, aligning with industry standards in large companies. This approach maximizes the performance and applicability of AI models.

26. 🌟 Future of Gen AI at Capital One

  • Capital One is currently in an exploratory phase regarding Gen AI tools and has not committed to a specific tool, as the market lacks a clear leader.
  • Agentic workflows are emphasized as crucial for automating back-office processes, software engineering, and document verification, indicating a focus on integrating business process expertise with technology.
  • Capital One has introduced two new roles: AI Engineer and Applied AI Researcher, to support Gen AI initiatives, highlighting the need for talent that combines software engineering, business process understanding, and prompt engineering skills.
  • The company stresses the importance of agility and adaptability in talent to solve diverse problems and drive transformation as AI and Gen AI reshape processes and systems.
  • Capital One is preparing for significant transformations in its processes and systems through Gen AI, requiring a workforce that can envision and implement new solutions.

27. 🔮 Excitement and Challenges in AI Innovation

  • Inference costs have dropped significantly, with a 100x reduction reported by A16Z in the last two years, encouraging more companies and open source developers to innovate.
  • Rapid reduction in inference costs per token is making AI research more accessible and less expensive for freelance innovators.
  • The decline in costs is expected to spur the development of open-source models incorporating advanced reasoning, which could solve complex, multimodal problems, particularly in enterprise applications.
  • The fast-paced evolution of AI technology presents a challenge in staying updated, with new developments emerging weekly.
  • Companies like OpenAI have leveraged cost reductions to improve their models significantly, demonstrating the practical impact of these advancements.
  • Startups are increasingly able to compete with larger firms due to lower barriers to entry, fostering a more dynamic and competitive market.
  • The rapid innovation driven by cost reductions also brings challenges such as maintaining quality and security in new AI models.
View Full Content
Upgrade to Plus to unlock complete episodes, key insights, and in-depth analysis
Starting at $5/month. Cancel anytime.