Driving Insights at Scale Using a Common Data Model (CDM) on Azure Data Lake

Why the Common Data Model is about to be the keystone of modern data insights architectures on Azure and how to use it with Power BI dataflows together with Azure data services

Patrick Pichler
Creative Data

--

Common Data Model For Analytics (Image by Microsoft)

Introduction

To begin with, this article isn’t about comparing data warehouses, data lakes, data lakehouses, data oceans, data meshes or whatever approach or naming may exist, it is about the evolution of the Common Data Model (CDM) and why it is about to be the keystone of modern data insights architectures on Azure by filling in some very important missing gaps¹. For the overall understanding it is, however, necessary to mention that many of today’s data warehousing solutions still integrate relational databases. A very common pattern over the last years have been hybrid architectures or so-called Modern Data Warehouses leveraging distributed technologies for preprocessing and staging data while relational databases are used for serving data and making it ready for analysis. The versatility and efficiency provided by new distributed technologies throughout the data integration process have been the main argument in favor of using data lakes as a complementary, especially by managing very large data warehouses which have become slow and inflexible over the time. On the other hand, driving proper reporting and critical business decisions entirely based on data lakes has still been taken with caution — often for good reasons. For instance, it would be an overkill to integrate data lakes into data warehouses consisting of only small relational data, likewise, data lakes still lack some mature enterprise features only provided by relational databases. Anyway, the technology is catching up rapidly and we can see technological advances making data lakes increasingly attractive for analytics and at least those feature aspects tend to completely disappear. Alright, enough with this small excursion, let’s jump into the link between the CDM, Power BI dataflows and Azure data services including the advances made over the time.

Common Data Model

The CDM was initially rolled out in early 2016 as a part of the introduction of PowerApps with the goal of enabling data and application interoperability through unified and standardized data models. The foundation of a CDM is lake-based, initially consisting of CSV files organized in one folder per table along with one single file called model.json for defining the metadata. This file acts as an entry point and enables applications and services to easily read and understand the actual data by obtaining different semantic information such as table descriptions, column data types or the data file locations. The CDM is generally highly integrated in most of Microsoft’s applications through different data synchronization and integration services which are nowadays covered under the term Dataverse, the formerly called Common Data Services (CDS)².

In spring 2020 Microsoft expanded the Common Data Model metadata concept by introducing the backward compatible manifest approach making it much more flexible, scalable, and performant. This manifest oriented metadata concept is richer and allows for different scenarios previously unavailable in model.json oriented metadata. For instance³:

  • A Common Data Model can now contain more than one manifest and/or even sub-manifest files (*.manifest.cdm.json) describing different set of tables and relationships enabling different views of data instead of having only one huge model.json file
  • A Common Data Model can now contain search pattern that can be used to describe partition files and dynamically discover data partition objects in collections instead of having specific URL for every new data partition added to the folder
  • A Common Data Model can now contain lists of known relationships expressed in the data through foreign key references aligning with the primary key values from other tables
  • A Common Data Model can now contain data stored in the column oriented Apache Parquet storage format additionally to the legacy CSV files allowing much better partitioning, compression and query performance — also Delta format is possible, please read on.
Model.json to Manifest (Image by Microsoft)

Power BI dataflow

Power BI has always been an important part of the CDM sphere as it enables a consolidated view of data across the enterprise. The introduction of Power BI dataflows has then even increased this importance two years later in 2018 making it a consumer and producer at the same time. Since then, Power BI dataflows enable you to author data pipelines directly within the Power BI service through a Power Query online interface backed and executed by Apache Spark’s distributed in-memory processing engine. The result is written to a (currently still old fashioned model.json) CDM compliant folder structure in a data lake and lays the foundation for datasets which is considered to be the data modelling layer in this setup, while dataflows constitute the data integration layer. This way, Power BI dataflows can meanwhile be used to build a complete star schema–designed data warehouse, including fact and dimension tables in your own data lake storage mounted to the Power BI service. This gives you full control over the underlying produced CDM and use it for different needs and purposes. By using it for analytical use-cases, for example, data can either further be imported into Power BI leveraging the tabular engine for interactive analysis or accessed via direct query providing real-time analysis which currently requires premium license⁴.

Power BI datflow (Image by Microsoft)

To sum up, Power BI dataflows have brought way more flexibility to create CDM opening up many new possibilities. Above all, to break down data silos and unlock new insights through an enforced standardized data model directly on the data lake. However, there are still some missing parts in order to have all functionalities usually found in data warehouses such as historical analysis or the integration of other Azure data services which were almost completely left out so far and could just act as a consumer of CDM.

Azure data services

That’s why there has been put a huge effort into enabling data exchange and interoperability through the Common Data Model and Azure Data Lake Storage by using Power BI dataflows together with Azure data services. Most of all, to make landing data in a CDM compliant format in the data lake as seamless as possible. This has been achieved by the development of a dedicated Apache Spark CDM connector. This makes it nowadays possible to not only read CDM, but actually write to it via most of the Azure data services such as Azure Data Factory (data flow), Azure Synapse Analytics and Azure Databricks, in fact, all Apache Spark backed services. For instance, data can be ingested using Azure Data Factory from both the cloud and on-premises sources and stored along with a CDM metadata description in Azure Data Lake. This staging data can be linked, for example, to Power BI Staging dataflows. Azure Databricks can then be used for complex data transformation, cleaning, and preparation tasks that can’t be done in Power BI dataflows. This result can then be linked to Power BI Transformation dataflows providing a dimensional data model for final analysis or to use it for further advanced analytics use-cases. This new Apache Spark CDM connector also entails a cost-saving benefit as it also eliminates the need for Apache Spark/Databricks cluster including their virtual tables on top of the data lake which have been serving as a kind of bridge to Power BI so far, so that it can read all the produced data properly⁴.

Now it’s becoming even more exciting, thanks to another new dedicated CDM Delta Lake connector, CDM now even supports Delta Lake storage format including most of the capabilities provided in the Databricks ecosystem and so-called Data Lakehouse architecture such as combining streaming and batch data, ACID transactions, time travels, schema evolution, etc. It hence can combine the best of both worlds which makes it a perfect fit for modern data insights architectures⁵.

Common Data Model For Analytics (Image by Microsoft)

Conclusion

In the early days of adopting or complementing data lakes for data warehousing there have been lost valuable metadata from databases when writing data to data lakes such as data types, constraints or foreign key relationships. This was often referred to “de-relationalize” data and it took several years to get slowly rid of this standing by increasingly catching up with many mature relational database functionalities or even introduce new functionalities on top. However, there is still a way to go and it also may require a little time to get used to work with data lakes and the way data is stored. Nonetheless, the alignment of these services and technologies provides very exciting capabilities and I look forward to see the combination of CDM and Data Lakehouse further evolving — let’s see where it may lead in the coming months —standardized data lakehouses maybe?

[1]: Darryll Petrancuri. 2020. Exciting Common Data Model Changes- Big Benefits for Data Insights Architectures. 2021. [ONLINE] Available at: https://www.linkedin.com/pulse/exciting-cdm-entity-changes-microsoft-power-platform-petrancuri/. [Accessed 12 April 2021].

[2]: Microsoft. 2020. Metadata format — Common Data Model — Common Data Model [ONLINE] Available at: https://docs.microsoft.com/en-us/common-data-model/model-json. [Accessed 12 April 2021].

[3]: Microsoft. 2020. Introducing Common Data Model metadata manifest — Common Data Model [ONLINE] Available at: https://docs.microsoft.com/en-us/common-data-model/cdm-manifest. [Accessed 12 April 2021].

[4]: Microsoft Azure. 2020. New Common Data Model connector for Apache Spark in Azure Synapse Analytics & Azure Databricks (in preview) [ONLINE] Available at: https://azure.microsoft.com/en-us/updates/new-common-data-model-connector-for-apache-spark-in-azure-synapse-azure-databricks/. [Accessed 12 April 2021].

[5]: TECHCOMMUNITY.MICROSOFT.COM. 2020. ADF Data Flow Updates: CDM and Delta Lake Connectors Generally Available . [ONLINE] Available at: https://techcommunity.microsoft.com/t5/azure-data-factory/adf-data-flow-updates-cdm-and-delta-lake-connectors-generally/ba-p/1968614. [Accessed 12 April 2021].

--

--

Patrick Pichler
Creative Data

Promoting sustainable data and AI strategies through open data architectures.