Streamlining Databricks: CI/CD your Notebooks with DevOps Pipelines and orchestrate via Azure Data Factory (Series)
On this series I'm going to show you how to provision your Databricks infrastructure with BICEP and to connect your workspace to Azure's Entra ID to manage users & groups. Furthermore, I'll show you how to deploy your notebooks across Environments with yaml pipelines and orchestrate with ADF
Introduction
I am thrilled to announce my participation as a speaker at the upcoming Databricks Oslo User Group MeetUp on March 5th, 2024, hosted at Microsoft Norway. With just one hour to cover an extensive topic, I've devised a solution: a comprehensive article series. This series aims to deep into the intricacies of Databricks, focusing on efficient workspace provisioning, user management, notebook deployment with DevOps pipelines, and orchestration via Azure Data Factory. Attendees to upcoming Databricks Oslo User Group MeetUp on March 5th, 2024 and enthusiasts alike will benefit from detailed explanations and accompanying code, ensuring a thorough understanding of these essential concepts.
Where to start?
If this is the first time you came across any of my articles and/or you are somehow new on these topics? If so, I'd strongly suggest you to take a look at these two articles:
If not, then I'll assume you don't need any introduction to this topics as I'll dive almost right into the code 🤞
Part 1. Provision your Workspace infrastructure with BICEP
Discover the power of BICEP in provisioning a robust Databricks workspace infrastructure, including the Azure Data Factory, Key Vault and Storage account; I'll use some PowerShell to configure any necessary RBAC permissions between resources for enhanced efficiency and scalability in data management and processing... plus tips & tricks as usual😉
Part 2. Manage Users with Azure's Entra ID
Did you know that users with Contributor role at Resource Group, automatically become administrators of any deployed Databricks workspace when they login! Needless to say that this is not a very good practice from the security nor governance perspective, the solution is configure SCIM provisioning connector and use Azure's Entra ID security groups to separate users from administrators, I'll show you how in this part!
Part 3. Deploying Notebooks across environments with DevOps YAML Pipelines
Learn to harness the power of DevOps YAML pipelines to create artifacts from collections of Python files and effortlessly deploy them to Databricks workspaces. Gain insights into optimizing your workflow for efficiency and reliability, ensuring smooth transitions between development, testing, and production environments.
Part 4. Orchestrate your Notebooks via Azure Data Factory
Integrating Azure Databricks with Azure Data Factory (ADF) allows you to seamlessly orchestrate data workflows and execute Databricks notebooks within your data pipelines, this integration empowers you to leverage the strengths of both, while Databricks provides scalable analytics and machine learning capabilities, ADF complements this by enabling you to schedule, trigger, and manage these data movement and transformations; this unified approach simplifies development, monitoring, and maintenance