Streamlining Databricks: CI/CD your Notebooks with DevOps Pipelines and orchestrate via Azure Data Factory (Series)

On this series I'm going to show you how to provision your Databricks infrastructure with BICEP and to connect your workspace to Azure's Entra ID to manage users & groups. Furthermore, I'll show you how to deploy your notebooks across Environments with yaml pipelines and orchestrate with ADF

Streamlining Databricks: CI/CD your Notebooks with DevOps Pipelines and orchestrate via Azure Data Factory (Series)

Introduction

I am thrilled to announce my participation as a speaker at the upcoming Databricks Oslo User Group MeetUp on March 5th, 2024, hosted at Microsoft Norway. With just one hour to cover an extensive topic, I've devised a solution: a comprehensive article series. This series aims to deep into the intricacies of Databricks, focusing on efficient workspace provisioning, user management, notebook deployment with DevOps pipelines, and orchestration via Azure Data Factory. Attendees to upcoming Databricks Oslo User Group MeetUp on March 5th, 2024 and enthusiasts alike will benefit from detailed explanations and accompanying code, ensuring a thorough understanding of these essential concepts.


Where to start?

If this is the first time you came across any of my articles and/or you are somehow new on these topics? If so, I'd strongly suggest you to take a look at these two articles:

Infrastructure-as-code made easy with Bicep language
In this article, we will explore what Bicep is, how it works, and how it can help you simplify and automate your Azure deployments.
Resilient Azure DevOps YAML Pipeline
Embark on a journey from Classic to YAML pipelines in Azure DevOps with me. I transitioned, faced challenges, and found it all worthwhile. Follow to learn practical configurations and detailed insights, bypassing the debate on YAML vs Classic. Your guide in mastering Azure DevOps.

If not, then I'll assume you don't need any introduction to this topics as I'll dive almost right into the code 🤞


Part 1. Provision your Workspace infrastructure with BICEP 

Discover the power of BICEP in provisioning a robust Databricks workspace infrastructure, including the Azure Data Factory, Key Vault and Storage account; I'll use some PowerShell to configure any necessary RBAC permissions between resources for enhanced efficiency and scalability in data management and processing... plus tips & tricks as usual😉

Part 2. Manage Users with Azure's Entra ID

Did you know that users with Contributor role at Resource Group, automatically become administrators of any deployed Databricks workspace when they login! Needless to say that this is not a very good practice from the security nor governance perspective, the solution is configure SCIM provisioning connector and use Azure's Entra ID security groups to separate users from administrators, I'll show you how in this part!

Part 3. Deploying Notebooks across environments with DevOps YAML Pipelines 

Learn to harness the power of DevOps YAML pipelines to create artifacts from collections of Python files and effortlessly deploy them to Databricks workspaces. Gain insights into optimizing your workflow for efficiency and reliability, ensuring smooth transitions between development, testing, and production environments.

Part 4. Orchestrate your Notebooks via Azure Data Factory

Integrating Azure Databricks with Azure Data Factory (ADF) allows you to seamlessly orchestrate data workflows and execute Databricks notebooks within your data pipelines, this integration empowers you to leverage the strengths of both, while Databricks provides scalable analytics and machine learning capabilities, ADF complements this by enabling you to schedule, trigger, and manage these data movement and transformations; this unified approach simplifies development, monitoring, and maintenance