Handling Infrastructure in Azure: Shared vs Data-Product

Shared Infrastructure is not merely a set of resources; it's a carefully designed approach to resource management that promotes efficiency, reduces redundancy, and ensures a unified foundation for Enterprise-Scale Analytics Data-Products.

Handling Infrastructure in Azure: Shared vs Data-Product

Introduction

This article explores the innovative notion of Shared Infrastructure within the Azure ecosystem. Shared Infrastructure is not merely a set of resources; it's a carefully designed approach to resource management that promotes efficiency, reduces redundancy, and ensures a unified foundation for Enterprise-Scale Analytics Data-Products.


Setting the stage

In the context of software development in Azure:

Infrastructure refers to the underlying resources and services that are required to support software applications and workloads, it can be defined and managed using various services and tools; defining infrastructure in Azure involves configuring and managing a wide range of resources and services.

Let's us now consider a simple ETL workload "A" where we need to fetch data from a source (Extract), perform cleansing and cross-referencing operations (Transform) and save the output data into a target (Load)... what type of infrastructure would be needed to support this workload in Azure? Most likely, we will need an Azure Data Factory to orchestrate the process, also a SQL Server and a Database to store the output and a Key Vault to store sensitive information, correct?

Great! We create the Azure infrastructure to support workload "A" and the development works begins, shortly thereafter, we need to accommodate another ETL workload "B" meant to do something similar to "A" but not exactly the same, then the question would be "do we need to create a whole new data factory and SQL logic server for workload "B"? Isn't one already in place for workload "A"?

Azure Data Factory can host hundreds of pipelines, run up to 50 in parallel (depending on the service tier) and a single Logic server can have up to 5000 databases!

So, wouldn't be better for workloads "A" and "B" to share these? I'll commit my self to answer with an emphatic Yes for the great majority of cases leading to the concept of having a "Shared Infrastructure"


Shared Infrastructure

So, what is this "Shared Infrastructure" concept? It can be defined as a group of resources that are meant to be unique per deployment environment and will constitute a reasonable foundation to support most of todays' Enterprise-Scale Analytics Data-Products [What is a data product?]; it should designed to be reusable across different deployment environments, promoting consistency and efficiency in managing resources.

  • There should be an owner (person of group) of these set Shared Resources across all deployment environments and will also be responsible of adding new resources if deem necessary
  • Owner will define all applicable policies, naming conventions and restrictions
  • Each Data-Product is a "Tenant" within our Shared Infrastructure, therefore, obligated to adhere to policies, naming conventions and restrictions applicable.

Deployment environments

In the context of this article, an Azure Resource Group (RG) will constitute a Deployment Environment, this means that there should be ONLY ONE of the resources listed in the Shared Resources section per Resource Group.

Number and nature of deployment environments won't be discuss on this article, nonetheless, I strongly recommend using the naming conventions... in my case, I normally use 2-varchar code as follow: d = development, t = integration testing, ua = user acceptance testing (aka pre-production) and p = production.

Shared Resources

For my Enterprise-Scale Analytics Data Products I selected the following Azure Resources to constitute my Shared Infrastructure:

  1. Key Vault. Make sure its permission model to be Azure role-based access control
  2. SQL Logic Server. Central administrative point for a collection of databases; I will not recommend to disable SQL Authentication but to adhere to security best practices.
  3. Storage Account for Data Ingestion. Hierarchy enabled storage account to receive raw/temp data for processing
  4. Storage Account for Data Analytics. Hierarchy enabled storage account to store long-retention/curated data for data analytics.
  5. Storage Account for Workload Support. Hierarchy disabled storage account (blobs) mainly for supporting processes (e.g. cloud-shell, logs, diagnostics, backups, etc.)
  6. Databricks. Preferably configured to provision identities to your Azure Databricks account using AAD, this can be done thru a System for Cross-Domain Identity Management (SCIM) [Configure SCIM provisioning using Microsoft Microsoft Entra ID (formerly Azure Active Directory)]
  7. Azure Data Factory (Shared Components). According to MSFT documentation [What is Azure Data Factory?], ADF's key top-level components of are Pipelines, Datasets, Linked services Data flows, Integration Runtimes... but I consider Private endpoints and Global variables to be two additional and divide them into two categories: shared and application (shocking, isn't!)
    1. Shared. Integration runtimes, Linked services, Managed virtual network (if any) and Configuration/Global variables
    2. Data-Product. Datasets, Pipelines, Dataflows and Triggers
Why? In one word ... reusability! This way I ensure having linked services to be generic using parameters, ergo, re-usable across workloads! I also ensure the consistency and correct pricing tier of the runtimes and expose the global ought to be shared.

Data-Product Infrastructure

As described on MSFT article What is a data product?

Data products are created specifically for analytical consumption. They have defined and agreed-upon shapes, consumption interfaces, and maintenance and refresh cycles, all of which are documented.

Services

As mentioned, each Data-Product will be a tenant of our shared infrastructure and entitled to use its services as follow:

  1. Secrets, Keys & Certificates from Key Vault. Azure RBAC for key vault also allows users to have separate permissions on individual keys, secrets, and certificates
  2. Azure SQL Database hosting on Logical SQL Server. This will reduce administrative overhead overall.
  3. Containers, File Shares, Queues & Tables. Potentially from any of the shared storage accounts (Ingestion, Analytics & Support)
  4. Databricks clusters. Both job and all-purpose clusters will be data-product specific whom will define capacity requirements.
  5. Azure Data Factory (Data Product Components). Unlike other resources where there is clear boundary per tenant (e.g. database in a SQL server or a container within an storage account), in ADF, all components will be separated "logically" using naming conventions.

Data-Product Resources

It could be potentially anything that is not part of our Shared Resources! The following are the ones I usually see as part of Data-Product solutions: Azure SQL Databases, Azure Web Apps & Functions, Virtual Machines.


Conclusion

Overall, the article underscores the importance of planning and structuring Azure resources effectively to maximize efficiency, consistency, and reusability across different deployment environments. The concept of Shared Infrastructure aligns with best practices for managing resources in complex software development scenarios, where maintaining a balance between flexibility and resource management is crucial.


Call to Action

By embracing the Shared Infrastructure approach, you can simplify resource management, reduce complexity, and promote efficiency in your Azure software development projects. Start by identifying opportunities for shared resources and gradually implement this model to achieve greater consistency and reusability across your deployment environments.

Here are some actionable steps you can take:

  1. Evaluate Your Workloads: Assess your current and upcoming workloads to identify opportunities for shared resources. Look for patterns and similarities between different projects that can benefit from the Shared Infrastructure model.
  2. Define Naming Conventions: Establish clear and consistent naming conventions for your Azure Resource Groups (RGs) to represent different deployment environments. This will help you manage resources effectively and distinguish between environments.
  3. Select Shared Resources: Choose the appropriate shared resources for your Shared Infrastructure and ensure these resources are configured for reusability.
  4. Ownership and Policies: Appoint an owner or group responsible for managing the Shared Infrastructure across all deployment environments. Define policies, naming conventions, and restrictions to ensure consistent resource management.
  5. Implement Shared Infrastructure: Incorporate the Shared Infrastructure model into your Azure projects. Make use of shared services and resources whenever possible to reduce administrative overhead and enhance resource consistency.
  6. Continuously Review and Optimize: Regularly assess the effectiveness of your Shared Infrastructure model. Adapt it to evolving project requirements and expand its usage to maximize efficiency.