Implement Compression – The Storage of Data

  1. Log in to the Azure portal at https://portal.azure.com ➢ navigate to the storage account you created in Exercise 3.1 ➢ choose the Containers menu item ➢ select a directory ➢ upload the two GZ and ZIP compressed CSV‐formatted data files located in the Chapter04/Ch04Ex01 directory on GitHub at https://github.com/benperk/ADE ➢ navigate to the Azure Synapse Analytics workspace you created in Exercise 3.3 ➢ on the Overview blade, click the Open link in the Open Synapse Studio tile ➢ select the Data hub ➢ follow the instructions from Exercise 3.9 to create an integration dataset ➢ after completing the dataset, select one of the two compressed brainjammer reading files ➢ check the First Row as Header check box ➢ and then select the None radio box for Import Schema. Figure 4.1 illustrates how this might look.

FIGURE 4.1 Azure Synapse Analytics compression

  1. Click the Compression Type drop‐down box ➢ select either gzip (.gz) or ZipDeflate (.zip), depending on the file you uploaded ➢ click the Test Connection link ➢ click the Preview Data link ➢ click the Commit button ➢ and then click the brackets {} on the top right of Figure 4.1, which expose the JSON configuration for the integration dataset. Notice the compressionCodec and compressionLevel properties.

Table 4.1 shows the Azure Synapse Analytics file formats and their supported codecs.

TABLE 4.1 Supported codecs by file format

Formatbzip2 (.bz2)gzip (.gz)Deflate
(.deflate)
ZipDeflate
(.zip)
TarGZip
(.tar/.tar.gz)
TAR
(.tar)
Snappylz4
AvroNoNoYesNoNoNoYesNo
BinaryYesYesYesYesYesYesNoNo
DelimitedYesYesYesYesYesYesYesYes
ExcelYesYesYesYesYesYesYesYes
JSONYesYesYesYesYesYesYesYes
ORCNoNoNoNoNoNoYesNo
ParquetNoYesNoNoNoNoYesNo
XMLYesYesYesYesYesYesYesYes

In addition to the numerous supported codec types, there is also a property called Level that pertains to file compression. The options for the Level property are either Fastest or Optimal. If speed is your top priority over file size, then choose Fastest as the Level value. If a smaller file size is the priority over speed, choose Optimal. This is configured when creating an integrated dataset in Azure Synapse Analytics. The configuration is also possible using a JSON configuration file for the integration dataset, similar to the following snippet:

“compressionCodec”: “gzip”,
“compressionLevel”: “Fastest”,

Two compressed files are located in the Chapter04/Ch04Ex1 directory on GitHub at https://github.com/benperk/ADE. Both are CSV files; one is compressed using the GZ codec, and the other is compressed using the ZIP codec. These are the same files you used in some of the previous exercises and is over 20 MB in a decompressed state. The compressed files are just over 2 MB, which represents a reduction of size on the scale of 10 to 1. Therefore, compressing your data files in this manner would also result in a tenfold cost savings.

Write a Comment

Your email address will not be published. Required fields are marked *