Deploying Notebooks across environments with DevOps YAML Pipelines

Learn how to use DevOps YAML pipelines to create artifacts from collections of Python files and effortlessly deploy them to Databricks workspaces. Gain insights into optimizing your workflow for efficiency and reliability, ensuring smooth transitions between environments

Hector Sven

24 Feb 2024 — 4 min read

Welcome back 😁! This is Part 3 of the series ...

Setting the stage

I'm assuming that most of the people with interest on this article are already familiar with Azure Databricks and, from the development perspective, the difference between writing notebooks in the Workspace vs Repos, nonetheless, allow me to dive into the differences:

Databricks Workspaces: A workspace in Databricks is an environment for accessing all of your Databricks assets into folders and provides access to data objects and computational resources
Databricks Repos: Repos are Databricks folders whose contents are co-versioned together by syncing them to a remote Git repository; using a Databricks repo, you can develop notebooks in Databricks and use a remote Git repository for collaboration and version control

So, do we need both? and if so, why? And the answer is Yes, because of the branching strategy I chose and explain in this article Resilient AzDO YAML Pipeline - Modern software development practices

The codebase in this context will be Databricks notebooks, its content (e.g. Python scripts) are version-controlled through Databricks Repos and eventually deployed from development branches to a Databricks Workspaces in production-like environments 😁

Solution

On the next part of the article, I'll explain just the most relevant parts of the code

👨‍💻

REMEMBER TO DOWNLOAD THE CODE!!!! from this AzDO Git repository streamlining-databricks-with-devops-and-adf 😁🤞 and follow along.

Once again, I'm going to use the same stages and structure of the Resilient Azure DevOps YAML Pipeline

Create the artifacts

In this stage of the deployment, we select and copy the files checked-in the yaml's pipeline context branch (1) (read more on steps.checkout: self) to be publish as a named artifact.

Why the copy files task (2)? The reason is to filter only file types accepted by Databricks, for instance, text files (*.txt) are not allowed in Databricks; we publish article databricksNotebooks (3)

Getting the agent ready

The following tasks prepare the agent to interact with a Databricks workspace

First we create an access token (1), proceed by retrieving workspaces' URL (2), and we use these information to create a [DEFAULT] configuration profile in the .databrickscfg file (3) to quickly run the rest of commands.

Deploy notebooks to workspace

We finally proceed with the actual deployment of the notebooks!

I'm basically using a truncate & load technique by removing the old workspace and DBFS folders (1) to import a new set of folders and content from our prev. crated artifact into the target folder path (2).

Some tasks are optional but commonly included in the pipelines I've seen, like deploying configuration files to DBFS (3) or to download drivers or other dependencies (4).

Outcome

As expected, we can see our notebooks deployed to the workspace...

and the drivers in DBFS 😁👍

Execute the deployment

I explained the code, it is now your turn ...

🧑‍💻

DOWNLOAD THE CODE!!!! from this AzDO Git repository streamlining-databricks-with-devops-and-adf 😁🤞 and follow along.

Then, check on the video at the end of below article where I show you how to create the Azure DevOps pipeline

Conclusion

So, why not just connecting the main branch into production-like environment? Some organizations chose this design pattern, however, there are potential risks associated with this practice, for instance, if a bug is introduced into the main branch, it could immediately affect the production environment;

In conclusion, whether or not to connect a repository branch directly to a production environment is a decision that should be made based on your team’s specific needs, workflows, and risk tolerance.