Summary – Data Sources and Ingestion
This chapter covered the sources of data and the volume, variety, and velocity of ingestion. Azure Synapse Analytics contains ample features to manage the ingestion at Big Data scale. If you currently either have an HDInsight cluster or run your ingestion with Databricks, then Azure provides the service for you to run these at scale. There are also many different Azure products designed to house the ingested data. The product of choice depends on the variety of data ingested, like blobs, files, documents, or relational data. Managing the data size, managing age and retention, pruning data, archiving data, and properly storing it so that those management activities can be monitored are necessary so that appropriate actions can be taken.
Properly partitioning data results in optimized query performance. You learned to partition files by using the df.write.partitionBy() function and distributions like round‐robin, hash, and replicated. Another partition‐like approach for managing files is to store them in a directory structure by date and time, YYYY/MM/DD/HH. You learned about star schemas, which are made up of dimension and fact tables, managing dimension tables with SCD types, and using temporal tables for historical analysis. The concepts of an analytical store, a metastore, and a data lake should also be clear in your head.
In the final portion of this chapter, you provisioned a lot of Azure products, like Azure Synapse Analytics, Azure Data Factory, and Azure Databricks. You also learned about some streaming techniques and products like Event Hubs, IoT Hub, Kafka, and Azure Stream Analytics. After completing the exercises in this chapter, your knowledge regarding each of these products should be at a respectable skill level.
Exam Essentials
Design a data storage structure. Azure Data Lake Storage (ADLS) is the centerpiece of a Big Data solution running on Azure. Numerous other products can help in this capacity as well, like Azure Cosmos DB and Azure SQL. Each product is targeted for a specific type of data structure, files, documents, and relational data. Managing the ingestion of data is ongoing, and actions like pruning and archiving are necessary to keep this stage of the Big Data process in a healthy and performant state.
Design a partition strategy. A partition is a method for organizing your data in a way that results in better management, data discovery, and query performance. Optimizing the size of partitions based on groupings like arrival date and time or a hash distribution also makes for more performant ingestion and management. Monitoring the skew of your data and then reshuffling the data for better performance is something you must perform diligently.
Design a serving layer. The serving layer is one part of the lambda architecture. A hot and cold data path can provide real‐time or near real‐time access to data. To support such a process, you learned the concepts of a star schema; slowly changing dimension (SDC) tables of Type 1, 2, 3, and 6; and temporal tables.
Data ingestion. Azure provides numerous products designed for the ingestion of data. Azure Synapse Analytics is the one Microsoft is driving customers toward and the product they are adding new features to. Azure Data Factory provides ingestion capabilities; however, most of the existing capabilities are, or will soon be, found in Azure Synapse Analytics. Customers who use Databricks and Apache Spark can migrate their existing workloads to Azure. Azure Databricks is a Microsoft deployment of the open‐source version of Databricks.
Data streaming. Two Azure product groupings are optimized to support a streaming solution on Azure. Event Hubs/IoT Hub and Azure Stream Analytics are recommended to customers who want to use Microsoft products in the cloud. Customers who currently have an existing streaming solution based on Kafka and Apache Spark can provision and manage that workstream on Azure.