Orchestrate your Notebooks via Azure Data Factory

ADF simplifies the orchestration of Azure Databricks notebooks, streamlining your data workflows and ensuring efficient data processing.

Orchestrate your Notebooks via Azure Data Factory

All good things must come to an end! This would part 4 (and last) of the series...

Streamlining Databricks: CI/CD your Notebooks with DevOps Pipelines and orchestrate via Azure Data Factory (Series)
On this series I’m going to show you how to provision your Databricks infrastructure with BICEP and to connect your workspace to Azure’s Entra ID to manage users & groups. Furthermore, I’ll show you how to deploy your notebooks across Environments with yaml pipelines and orchestrate with ADF

Setting the stage

A discussion on whether or not use Azure Data Factory and Databricks together is like discussing whether or not to use a hammer and a screwdriver for any purpose: they are simply different tools, each with unique capabilities and you will achieve a superior result by using the strengths of both.

While both services offer connectivity to various data sources, Databricks is more versatile for analytics and machine learning workloads, whereas ADF excels in data movement and integration tasks, then the goal is to ...

🤓
Create a framework to streamline the interaction between Databricks and ADF, enabling efficient data processing, orchestration, and automation within your data workflows.

I have been publishing several articles that would support the solution presented below, in particular I would recommend for you to take a look at this one Automating ADF's CI/CD with Selective Deployment (Generation 2) and from my previous series...

Selective deployment of Azure Data Factory (ADF) components (Series)
Adding ADF the ability to selectively deploy components will significantly enhance the flexibility and efficiency of your data integration processes by allowing you to deploy specific components based on your business needs.

Solution

On the next part of the article, I'll explain just the most relevant parts of the code

👨‍💻
REMEMBER TO DOWNLOAD THE CODE!!!! from this AzDO Git repository streamlining-databricks-with-devops-and-adf 😁🤞 and follow along.

As shown on Provision your Workspace infrastructure with BICEP  we output workspace's Azure Resource ID and URL.

We are going to use this information to create parameterized ADF's linked services and global variables using a slightly modified version of the Automating ADF's CI/CD with Selective Deployment on Shared instances (Generation 3) where I create a "tokenized" version of my configuration Configuration files

"This files are use for replacing all properties environment-related, hence, its expected to be one per environment" (ref. ADF Selective Deployment G2)

Parameterized linked services

I think is a best practice for any ADF setup to create parameterized linked services, hence, following this best practice I create one for New job cluster and another for Existing interactive cluster as shown below.

Additional information on these two options:

  • New job cluster. Designed for running specific jobs or tasks within your Databricks workspace, ideal for isolated, short-lived workloads; incurs cost only during job execution.
  • Existing Interactive Cluster. Intended for interactive data exploration and collaboration using Databricks notebooks, use when you want to reuse an existing cluster that’s already running; incurs cost as long as the cluster is running
💡
Remember that using an existing cluster can be more cost-effective, especially for development purposes. However, be mindful of the costs associated with keeping an interactive cluster running continuously

Parameterized Global parameters

From MSFT on Global parameters in Azure Data Factory

"[...] are constants across a data factory that can be consumed by a pipeline in any expression. [...] When promoting a data factory using the continuous integration and deployment process (CI/CD), you can override these parameters in each environment."

The idea is to make available our shared Databricks workspace as easy and consistently as possible throughout environments.

Tokenizing selective deployment Config files

As advanced, we are going to use a "tokenized" version of our selective deployment Config files and use Replace Tokens extension by Guillaume Rouchon with the values generated while Provision your Workspace infrastructure with BICEP

I'm using default token's prefix #{ and suffix }# as documented by the extension.

Replacing tokenized values

Output values from the IaC stage are stored on the environment's Key Vault (1) to be re-used. In the deployment stage (2) we first call AzureKeyVault@1 task to make available as variables all the secrets listed in the filter (3), we then create a copy of our tokenized config files in the Build.ArtifactStagingDirectory (4) for tokens to be replaced by the values collected and fetch from key vault using the Replace Tokens task (5)

Finally we deploy linked services and global variables grouped as ADF's Core Infra (adfcoreinfra)

I explain more on this concept on my article Handling Infrastructure in Azure: Shared vs Data-Product, in a nutshell, ADF's key top-level components are Pipelines, Datasets, Linked services Data flows, Integration Runtimes... I divide them into two categories:

  • Shared (aka adfcoreinfra). Integration runtimes, Linked services, Managed virtual network (if any) and Configuration/Global variables
  • Data-Product. Datasets, Pipelines, Dataflows and Triggers

Execute Core Infrastructure pipeline

Get the code from AzDO Git repository streamlining-databricks-with-devops-and-adf and deploy the whole infrastructure linked together 😁🤞!

👨‍💻
Create your AzDO pipeline from the yaml file located in this path devops/pipelines/coreinfra/coreinfra-cd-infrascode.yml.

Finally, let's call some Notebooks!

Assuming that you followed the series and Deployed Notebooks across environments with DevOps YAML Pipelines , I'll jump right to Azure Data Factory's Live Mode (1) and from Factory Resources, create a new pipeline (2)

Add a new Activity to your pipeline, use the Databricks Notebook (1) and go to Azure Databricks (2)

Place your self in linked service property prm_workspace_url (1) and click Ad dynamic content (2)

And use the corresponding global parameter corresponding to Databricks workspace URL

Repeat the operation associating the each linked service parameters to its corresponding global variables

Configure Databricks Notebook activity settings (1) with any existing Notebook path (2) deployed on the Workspace

As expected, running the pipeline will create a Job Compute in the Databricks Workspace 😎👍

😁
From this point forward is to use all orchestrating capabilities from Azure Data Factory to support any big data processinganalytics, and machine learning workloads from Databricks!!

Call to action

I explained the code, now is your turn to ...

🧑‍💻
DOWNLOAD THE CODE!!!! from this AzDO Git repository streamlining-databricks-with-devops-and-adf 😁🤞 and follow along.

I'll create a bonus article to show you how to call Databricks jobs, don't miss it and subscribe 😉