Data Ingestion – Data Sources and Ingestion

Figure 3.75 shows how data is ingested into Azure Databricks.

There are numerous ways to ingest data, including by streaming via Event Hubs, Stream Analytics, or Kafka, or by copying data from a remote source using Azure Data Factory. You also can stream data directly from an IoT device and store it into a datastore. The ingested data may be stored, initially in a raw format, on an ADLS container. A technology called Delta Lake is built into Azure Databricks. Delta Lake is a software library that sits on top of your data lake to help manage activities performed on the data. It is especially helpful in terms of data landing zones (DLZ), which concern the classifications of data into bronze, silver, and gold data zone categories. The features and capabilities described in this section concerning Azure Databricks are then used to transform the data from its initial raw state into something that can be analyzed, studied, and used for making business or scientific decisions. The model illustrated in Figure 3.75 is often referred to as medallion architecture.

FIGUER 3.75 Azure Databricks data ingestion

Delta Lake on Databricks

Delta Lake is a storage and management software layer that provides numerous benefits for working with your data lake over the default features of Azure Databricks. A data lake is illustrated in Figure 3.76.

Data comes from many different sources. A data lake is a single location to store all this varied data into. You might consider the data quality level (DQL) in the data lake as cleansed data, or bronze. From that state, you implement something like the lambda architecture (refer to Figure 3.13) or use the capabilities in Delta Lake to progress the data’s transformation to business level, or gold, quality data. Some of the benefits you gain by implementing Delta Lake are the following:

  • ACID transactions
  • Streaming and batch unification

FIGUER 3.76 A data lake

  • Time travel
  • Upserts and deletes

Before proceeding to the details of those features, complete Exercise 3.15, where you will perform some activities with Delta Lake on the Azure Databricks service you created in Exercise 3.14. Delta Lake version 1.0.0 is installed by default with the Databricks runtime version 9.1 LTS. Version 9.1 LTS is currently the default version and should be the one in which you provisioned the cluster in Exercise 3.14.

Write a Comment

Your email address will not be published. Required fields are marked *