Implement Physical Data Storage Structures – The Storage of Data
EXAM DP‐203 OBJECTIVES COVERED IN THIS CHAPTER:
- Implement a partition strategy
- Design and implement the data exploration layer
WHAT YOU WILL LEARN IN THIS CHAPTER:
- Storing raw data in Azure Databricks for transformation
- Storing data using Azure HDInsight
- Storing prepared, trained, and modeled data
The primary objective of Chapter 3, “Data Sources and Ingestion,” was to explain the sources of data and the Azure products for ingesting it. As soon as the data exists, it is stored somewhere. The data can be on an IoT device, on a flash drive, in memory, or on the wire someplace between where it was created and where it is going. The ingestion products discussed in Chapter 3 are the entry point for that data onto the Azure platform. Azure Synapse Analytics, Azure Data Factory, Event Hubs, IoT Hub, and Kafka are products optimized for ingestions. Azure products that are optimized for the storage of the just ingested data are products like ADLS, Azure SQL, Azure Cosmos DB, Azure HDInsight, and Azure Databricks. This chapter discusses the techniques for optimally storing a potentially vast variety and volume of incoming data.
Implement Physical Data Storage Structures
When you think of something that is physical, it usually means the object is something you can touch with your hands. The same is true when it comes to physical data storage. A physical data storage device is a disk drive connected to or mapped from a computer. The data structures placed onto the physical disk are the directory patterns in which the files containing your data are stored.
Implement Compression
Processing large files can cause networking bottlenecks and increase the number of I/O operations. Compression reduces the size of files and can therefore have a positive impact on network and I/O latencies. Knowing that a company is charged by the amount of occupied storage space and ingress/egress data transfers, it makes sense to use as little of those as possible. Data compression makes the file in which it is stored smaller; decompression reverts the data to its original form and size and is required before the content within the file can be queried. The approach for performing the compressions/decompression of data, also known as encoding/decoding, begins with choosing the codec. Complete Exercise 4.1 to implement compression and learn more about what a codec is.