Data Integration With Azure Databricks

Brief introduction to Databricks including its deployment options and the different ways of integrating data into Delta Lake

Patrick Pichler

Published in

Creative Data

8 min readFeb 23, 2021

Introduction

Before Databricks, Apache Spark quickly replaced Hadoop’s MapReduce programming model in being the number one processing technique when it comes to distributed computing and processing huge amounts of data in real-time. It’s optimized for speed and computational efficiency by storing most of the data in-memory and not on disk, compared to the MapReduce approach which is rather meant for long running batch jobs. Databricks nowadays isn’t just Apache Spark anymore, but it’s a fully managed end-to-end data analytics platform on the cloud with collaboration, analytics, security and machine learning features, data lake, and data integration capabilities. That means, apart from data processing, it also includes everything that is needed to integrate data. With that said, there is no essential need for any other service or tool to ingest data and design proper data integration pipelines. Though, most architectural blueprints propose otherwise these days, especially when it comes to ingesting on-premises data into the cloud or data coming from SaaS applications and Web APIs. Before we look into the reasons behind this and go through the different ways for ingesting data and integrating it into the target schema, let’s first see the available deployment options of Databricks and why running it on Azure has its advantages.

Databricks Deployment

Databricks is at the time of this writing available on Microsoft Azure, Amazon AWS and quite recently also Google Cloud, which is by the way the first fully container-based Databricks runtime among the cloud providers. However, while it integrates well with some AWS and Google Cloud services such as Redshift and BigQuery respectively, the integration on Azure is much deeper. It fully integrates with nearly all services and not only that, by using Azure Data Factory’s Data Flows (ADFDF) one can even design scaled-out data transformation pipelines via drag-and-drop experience on top of Databricks cluster hosted on Azure. That’s why Databricks on Azure is actually considered as a first-party service and hence enables overall a more seamless and streamlined experience.

Azure Databricks Data Ingestion

By working with Databricks data is usually stores using the open sourced storage layer Delta Lake which sits on top of the actual data lake storage, such as Azure Data Lake Storage, AWS S3, GCS or HDFS. This brings ACID transactions to Apache Spark and stores data in a highly compressed and efficient columnar fashion by using Apache Parquet. This file format is also the foundation of Databricks’ Data Lakehouse paradigm including its bronze, silver, and gold architecture with the idea to combine the best elements of both data lakes and data warehouses.

Ok, but what are the options now to ingest data? Well, there are basically three different ways to get data into Databricks:

1. Apache Spark APIs

First of all, the native Apache Spark APIs which allow to connect to both cloud as well as on-premises data sources provided that a secure virtual network had been set up. Once this is done, using native Spark allows to directly connect and merge data coming from databases, messaging systems and file sources regardless of the origin. As for on-premises sources this would allow to connect and merge data coming from local databases, Kafka topics and even locally hosted network file filesystem (NFS) mounted to Databricks. The direct communication between the open Apache Spark APIs and other Azure services requires the installation of additional libraries, data can then also be ingested from Azure SQL databases, Azure file systems and streaming events coming from Azure Event/IoT Hubs. This way, Spark meanwhile connects to nearly everything, thanks to its great community including Databricks itself.

2. Databricks APIs

In addition to the native Spark APIs, Databricks developed additional connectors and ways to ingest data into Delta Lakes making the data ingestion process even easier. They are especially designed to simplify the entire state management of file data sources in the sense of what files have already been ingested and processed. This has usually been achieved by moving ingested files to another folder or tracking the state in a separate database. However, the complexity of such pipelines often cause problems leading to wrong data or even data loss if files haven't been processed at all. To keep things simple, Databricks introduced the COPY INTO batch command that allows to incrementally loads new files on a scheduled basis. They also introduced the Auto Loader functionality which is doing the same things just as a continuous process by grabbing files as they arrive in a given folder. This not only reduces the complexity of the pipelines but also the latency of incoming data streams. Both ways automatically track and manage states behind the scenes which means files in the source location that have already been loaded are skipped.

3. Data Integration Partners

Despite the endless flexibility to ingest data offered by the methods above, businesses often rely on data integration tools from providers Databricks is partnering with. Their low-code/no-code and easy-to-implement environments come with many built-in connectors to a wide range of different sources, especially SaaS offerings and Web APIs. They allow to copy and sync data from hundreds of data sources using different auto-load and auto-update capabilities without writing even one line of code. Simplifying this data ingestion process allows to overcome the complexity and maintenance cost typically associated with pulling data together from many disparate sources. At the time of this writing, Fivetran, Qlik, Infoworks, StreamSets, and Syncsort are available on Databricks to ingest data, apart from the already longer existing native integration with Azure Data Factory. For instance, Azure Data Factory’s Copy Activity supports copying data from any of its supported data sources directly into the Delta Lake format. Azure Data Factory Data Flows enable to design, manage, maintain, and monitor Databricks pipelines using a drag-and-drop browser user interface as mentioned above.

Data ingestion partners including some of the popular data sources

Azure Databricks Data Integration

While Data Integration typically includes Data Ingestion, it involves some additional processes to ensure the accepted data is compatible with the repository and existent data. In practice, the above mentioned data ingestion options are often used in conjunction. For instance, Azure Data Factory is used for data provision in a landing zone and orchestrating the processes, the Databricks APIs apply impotent operations on the new data coming in and the open Apache Spark API for Kafka streams is used to consume streaming data coming from Azure Event/IoT Hubs. There also exists a dedicated connector for Azure messaging endpoints which would be actually more appropriate in this case. Archiving policies could further be set on the data lake for automatically moving older files to another folder. This approach altogether greatly reduces code and simplifies the data integration process.

Common data integration pattern on Azure

The next data integration challenge is the today’s heterogeneous landscape of different data sources and to bring together batch and streaming data. Thanks to the Structured Streaming API available from Spark 2.x onward, this has been extremely facilitated. It allows to express computation on streaming data in the same way you express a batch computation on static data achieving maximal code reuse. Among others, there can be applied streaming aggregations, event-time windows or stream-to-batch joins. Since it further builds upon the Spark SQL and Dataset/DataFrame API data can be easily transformed by using any of the provided languages (SQL, Python Scala, Java, R). The Spark SQL engine thereby takes care of running the process in an incrementally and continuously fashion and updating the final result as streaming data continues to arrive. The way all of this works is a unified pipeline incrementally processing and storing data towards the idea of a Kappa architecture.

Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.

However, relying on such a data integration approach having only one single code-base for batch and streaming data has also its own challenges, above all a reliable back-fill process. This is sometimes really difficult to implement, especially by replaying multiple days or even weeks of lots of incoming streaming data. In traditional Lambda architectures, this is solved by running the streaming pipeline in real time, while an additional batch pipeline is scheduled at a delayed interval to reprocess all the data for the most accurate results. The implied tradeoff hereby is to reconcile business logic across streaming and batch code-bases. Of course, the back-fill process could also be implemented using the Spark Structured Streaming’s batch mode, but this would probably overwhelm downstream systems and processes as they are just not designed for such heavy workloads. That means, running such performance intensive one-shot back-fill processes should be avoided whenever possible, instead should be considered to incrementally read from the sources.

Conclusion

In either way, Databricks extremely simplifies and streamlines the setup and maintenance of Apache Spark clusters, adding data security and automatic cluster management features. From an operational standpoint, this eliminates the complexity and costs of spinning up and maintaining clusters for distributed computing on your own. Comparing Databricks with self-hosted Apache Spark or Hadoop environments, then it generally allows to achieve faster time-to-value and cloud-native features like auto-scaling help to reduce compute usage and lower the overall costs. All this enables companies to be more efficient and to focus more on their business including their data. Another promising announcement was the fully contained-based Databricks runtime hosted on the Google Cloud using Kubernetes, this sounds very promising to also have a complete on-premises version of Databricks very soon — let’s see.