Microsoft Research

Microsoft Research - Differentially Private Synthetic Data without Training

Jinan Lin, an expert in privacy for generative models, introduces private evolution, a method for generating differentially private synthetic data. This approach leverages foundation models through inference APIs, avoiding the need for direct access to model weights or training, thus preserving user data privacy. The method involves using random and variation APIs to generate synthetic data that closely resembles private datasets while ensuring differential privacy. This is achieved by iteratively selecting and refining synthetic samples based on their similarity to private data, using Gaussian noise to maintain privacy guarantees. The approach is computationally efficient, scalable, and can outperform traditional DP fine-tuning methods in certain scenarios. Lin highlights the limitations of private evolution, such as its dependency on the availability of suitable foundation models, and discusses potential solutions and extensions, including combining multiple models and utilizing non-neural network tools for data synthesis.

Key Points:

Private evolution generates differentially private synthetic data using foundation models without training, ensuring privacy and data utility.
The method uses random and variation APIs to iteratively refine synthetic data, maintaining privacy through Gaussian noise.
It is computationally efficient, scalable, and can outperform traditional DP fine-tuning methods in some cases.
Limitations include dependency on suitable foundation models; solutions include combining multiple models and using non-neural network tools.
Private evolution is open-source and encourages community contributions to expand its capabilities.

Details:

1. Welcome to the Cryptography Talk Series 🎙️

Jinan Lin is a leading expert in privacy for generative models, having graduated from CMU in 2023.
The focus of the talk is on 'private Evolution', a topic in privacy for generative models.

2. Exploring Differential Privacy in Generative Models 🔍

Differential Privacy (DP) is used to ensure privacy in data evolution within generative models, offering a balance between data utility and privacy protection.
Private evolution is highlighted as a cool and elegant use of Differential Privacy (DP), allowing models to generate data without compromising individual data points.
DP techniques are applied to generative models to prevent the leakage of sensitive information, ensuring compliance with privacy regulations while maintaining model performance.
The implementation of DP in generative models can prevent adversarial attacks by ensuring that individual data points cannot be reverse-engineered from the generated data.
Examples include the use of DP in training datasets where sensitive information needs to be protected, ensuring that the output data remains useful for analysis without revealing private details.

3. The Value and Challenges of Data Privacy 🔒

Differentially private data training at Microsoft involves collaborative efforts across multiple teams and interns, showcasing the importance of teamwork in tackling data privacy challenges.
Data is described as 'the new oil,' underlining its essential role in driving technologies like foundation models that depend on large datasets for training.
Beyond major technological models, data is crucial in everyday workflows, highlighting its pervasive influence in both organizational and personal processes.
A significant challenge in data privacy is balancing the need for extensive data in training models with the necessity of protecting individual privacy.
Microsoft's approach includes implementing differentially private data training to ensure data privacy while maintaining the efficacy of large-scale models.
A case study from Microsoft demonstrates how cross-functional collaboration led to innovative solutions in data privacy, providing a model for other organizations facing similar challenges.

4. Understanding Differential Privacy Mechanisms 🛡️

4.1. Overview of Differential Privacy

4.2. Risks in Data Handling and Privacy

4.3. Synthetic Data and Privacy Guarantees

5. Differential Privacy in ML: Mechanisms and Benefits 🤖

5.1. Mechanisms of Differential Privacy

5.2. Benefits of Differential Privacy

6. Challenges of Current Differential Privacy Techniques ⚠️

6.1. Traditional Differential Privacy Techniques

6.2. Modern Approaches and Current Challenges

7. Introducing the Private Evolution Algorithm 🌱

Stronger pre-trained models improve dataset quality but lack DP fine-tuning capabilities, especially in API-based models.
OpenAI models provide fine-tuning APIs but lack DP guarantees, unsuitable for privacy-sensitive tasks.
Sending private data to third-party services for DP fine-tuning poses privacy risks, particularly for sensitive information like medical records.
Open-source models allow self-tuning with DP but often don't match the quality of closed-source API models.
Even if open-source models match performance, DP fine-tuning is computationally expensive, raising costs and resource needs.
The rise of models like ChatGPT in 2023 emphasizes the importance of advanced machine learning models.

8. Mechanics of Private Evolution 🔄

8.1. Inference API and Data Privacy

8.2. Quality and Performance

8.3. DP Guarantee and Potential Integration

8.4. Research and Publications

8.5. Workflow of Private Evolution

9. In-depth Look at Private Evolution's Functionality 🧩

9.1. Overview of APIs Used

9.2. Implementing APIs for Text

9.3. Foundation Models and Data Distribution

9.4. Algorithm Steps for Data Synthesis

9.5. Concrete Example and Iterative Process

9.6. Real-world Applications and Results

10. Discussion on DP Guarantees and Processes ❓

10.1. Understanding DP Guarantees

10.2. Processes in DP-based Synthetic Data Generation

11. Comparative Results and Insights 📊

The new algorithm achieves superior image quality with better trade-offs, evidenced by lower FID scores compared to Google's 2023 state-of-the-art DP fine-tuning method.
For a target FID score of 8 or 9, the new method requires an epsilon smaller than one, significantly enhancing privacy guarantees over the previous method's epsilon of 3032.
In text data, the algorithm increases accuracy from 32 to 37, outperforming previous state-of-the-art methods from 2023, which demonstrates a substantial performance improvement.
Despite being surpassed by better models, the old method won an honorable mention at ACL 2023 for its DP fine-tuning due to the complexity of implementation and communication costs.

12. Efficiency and Computational Costs 💻

Text processing achieves up to a 65% speed increase when using the same model for both text and images, as opposed to no speed increase for images alone.
Using the same model for text and images reduces computational costs, making operations cheaper overall.
Differential Privacy (DP) fine-tuning primarily requires GPU hours, with some tasks running on CPU, such as voting and aggregation.
The efficiency method involves adding GPU and CPU hours, although CPU costs are lower, resulting in an overall speed increase even when CPU is used.
Employing a unified model for both text and images significantly reduces the time and resources needed for processing, enhancing overall computational efficiency.

13. Limitations and Areas for Improvement in Private Evolution 🚧

Private evolution requires foundation models and does not involve training them, leading to significant challenges when adapting to new data distributions that differ from pre-training datasets.
A state-of-the-art DP fine-tuning approach achieves 97% accuracy on a simple DNSS dataset with recognizable handwritten digit samples, demonstrating its effectiveness in adapting to new distributions.
In contrast, private evolution using a foundation model pre-trained on ImageNet results in poor quality samples and only 30% accuracy, due to its inability to adapt the model to new data distributions.
Private evolution does not modify the weights of the foundation model, limiting its ability to generate quality samples if the foundational model lacks exposure to similar data during training.
DP fine-tuning adapts the weights of the foundation model, allowing it to significantly alter the generated distribution and improve performance.

14. Advancements and Extensions in Private Evolution 📈

14.1. Utilizing Multiple Foundation Models

14.2. Benefits of Combined Model Use

14.3. Expansion Beyond Foundation Models

15. Cross-Technology Integration Opportunities 🔗

15.1. Parameter-Based Face Image Generation

15.2. Randomized Parameter Perturbations

15.3. Improving Classification Accuracy with Data Synthesis

15.4. API Integration for Enhanced Generation

15.5. Two-Stage Image Generation Process

16. Applications in Federated Learning 🌐

A paper by Ho ET al. from Simu and Meta applied private evolution to federated learning and won the ICMU OR award.
Traditional federated learning involves a central server broadcasting a model to each device, which then trains it with local data and sends updates back, facing challenges like high communication costs and resource limitations on mobile devices due to large models.
The proposed approach replaces the model broadcast with private evolution to generate synthetic data sent to clients, reducing communication costs and maintaining data privacy.
Clients use private evolution to vote on the best data aligning with local data, sending votes back to the server, which aggregates them, ensuring Differential Privacy (DP).
This results in DP synthetic data representing all clients, followed by standard fine-tuning, with Local Differential Privacy (DP) applied to ensure client-level data privacy.

17. Concluding Remarks and Future Directions 🔮

The private evolution framework allows unlocking user data value while ensuring privacy protection without requiring training or weights, making it compatible with open-source models, API-based models, and known neural network tools.
This framework is computationally inexpensive and performs well across multiple domains, matching or outperforming DB fine-tuning in some cases.
An open-source Python library for private evolution is available on GitHub, intended to consolidate advances in this field into one accessible location.
The design of the library is modular, facilitating easy combination of different components and advancements.
Community contributions are encouraged to enhance the library, which includes a wrapper containing relevant papers on private evolution.
Since the publication of private evolution, community-driven follow-up work has exceeded the original contributions, with 223 discussed works and more ongoing research.

View Full Content

Upgrade to Plus to unlock complete episodes, key insights, and in-depth analysis

Starting at $5/month. Cancel anytime.