No Priors: AI, Machine Learning, Tech, & Startups - No Priors Ep. 103 | With Vevo Therapeutics and the Arc Institute
The Tahoe 100 dataset is a groundbreaking release in the field of biology and AI, providing the largest single-cell RNA sequencing dataset ever created. This dataset enables a wide range of machine learning applications, including the development of virtual cell models and drug discovery. The dataset is significant because it shifts the focus from protein structure prediction to understanding cellular dynamics and interactions in different biological contexts. This is crucial for developing treatments for complex diseases. The dataset includes 100 million single-cell data points, significantly increasing the available data for training AI models. This allows for more accurate modeling of cellular responses to drugs and genetic perturbations, which is essential for advancing drug discovery. The discussion also highlights the importance of open-sourcing this data to accelerate scientific progress and foster collaboration across the scientific community. The open-source approach allows for a broader range of researchers to contribute to and benefit from the dataset, potentially leading to faster and more innovative breakthroughs in the field.
Key Points:
- Tahoe 100 is the largest single-cell RNA sequencing dataset, crucial for AI applications in biology.
- The dataset enables virtual cell modeling, shifting focus from protein structures to cellular dynamics.
- Open-sourcing the data fosters collaboration and accelerates scientific progress.
- The dataset includes 100 million data points, enhancing the training of AI models for drug discovery.
- Virtual cell models can predict drug interactions, potentially revolutionizing treatment development.
Details:
1. ๐๏ธ Welcome and Introductions
1.1. Introduction of Key Figures
1.2. Tahoe 100 Dataset and AI in Biology
2. ๐ฌ Tahoe 100: Revolutionizing Drug Discovery
- The Tahoe 100 project is leveraging the largest single-cell RNA sequencing dataset to enable numerous machine learning applications, including drug discovery.
- The dataset marks the beginning of a new approach to drug discovery by integrating AI and machine learning into understanding and building medicines.
- Over the past 20 years, there has been significant accumulation of data on protein structures and interactions with drug molecules, but less on cellular behavior and gene function in different contexts.
- Tahoe 100 allows for the measurement of drug interactions with cells from various patient models, facilitating the construction of models similar to protein language models but in cellular contexts.
- Historical AI advancements have been driven by landmark datasets, like ImageNet in 2009 for machine vision; Tahoe 100 aims to do the same for cellular-level modeling.
- This dataset is foundational for training models that predict protein structures and cellular dynamics, offering insights into biology's response to health and disease.
- The initiative seeks to enhance the study of biological systems at higher abstraction levels, beyond individual molecular machines to entire cell operations.
3. ๐งฌ Virtual Cell Models vs Protein Prediction
3.1. Virtual Cell Models and Data Limitations
3.2. Data Advancements and Practical Applications
4. ๐งช The Power of Perturbational Data
4.1. From Correlation to Causation
4.2. Generalizable Models
4.3. Data Availability and Expansion
4.4. Training Across Diverse Cell Types
4.5. Scaling Data for Models
4.6. Perturbation Toolkit for Cancer
4.7. Mosaic Platform Innovation
5. ๐ Scaling AI and Open Source in Biotech
5.1. Data Generation and Hypothesis-Free Research
5.2. Cost Reduction in Biotech Research
5.3. Platform Efficiency and Patient Variation
5.4. Changing Science through Scale
5.5. Role of Human Intuition and Scaling Laws
5.6. Cross-Domain Learning and Application
6. ๐๏ธ Building a Platform, Not Just a Hypothesis
- Vivo generated 100 million single-cell data points and decided to open source it to encourage community engagement and upscaling from traditional data point thresholds.
- Open sourcing was strategic for maintaining a small, talented team by leveraging community involvement, thus avoiding the need for extensive in-house hiring.
- The company aims to eliminate data bottlenecks by providing high-quality curated data sets, which are crucial for advancing virtual cell modeling.
- A new resource, combining 100 million single-cell data points with 230 million curated observational data points, results in a 330 million cell data set, enhancing research opportunities.
- Automation and AI agents are emphasized as future directions, aiming to automate traditionally labor-intensive dry lab workflows.
- Vivo's approach reduced batch effects and data biases through systematic data collection and processing, enhancing the accuracy of data analysis.
- The data set includes 60,000 drug-cell line interactions, achieved by a small team over a short time, showcasing high efficiency and reduced data infection risks.
- Current predictive models for cell response have a low accuracy of about 10%, highlighting the need for improved benchmarks and model structures.
7. ๐ง Virtual Cells: The Future of Biotech
7.1. Introduction to Virtual Cells
7.2. Benefits and Challenges of Virtual Cell Models
8. ๐ Global Competition in Biotech Innovation
8.1. Single Hypothesis vs. Platform Companies
8.2. Emergence of Chinese Biotechs
8.3. Challenges in Biotech Operations
8.4. Innovation in Biotech
8.5. Strategic Shifts for Success
9. ๐ AI's Role in Drug Discovery's Future
- AI efficiently summarizes clinical trial documents, cutting costs in regulatory filings.
- Virtual cell models improve target identification, reducing discovery costs and increasing accuracy.
- AI in biotech is at an inflection point, similar to the 'ImageNet moment', driven by data and sophisticated models.
- Single cell sequencing allows for time-based analysis at single cell resolution, marking new advancements.
- The Evo2 model, trained on 9.3 trillion nucleotides, predicts pathogenicity with 94% accuracy, showcasing AI's genetic analysis potential.
- AI in biology is currently between the stages of GPT-1 and GPT-2, with protein models being more advanced.
- AI platforms can increase drug discovery success rates from 10% to 30%, despite a 10-year development cycle.
- The shift to AI in drug discovery is heavily reliant on data, marking a major methodological change.