Configure Delta Lake – Data Sources and Ingestion

Log in to the Azure portal at https://portal.azure.com ➢ navigate to the Azure Databricks service you created in Exercise 3.14 ➢ click the Launch Workspace button on the Overview blade ➢ choose the Compute menu item ➢ and then select the cluster you created previously. If the cluster is not running, start it by selecting it from the list and clicking Start.
Download and decompress the file brainwavesMeditationAndPlayingGuitar.zip located in the Chapter03/Ch03Ex15 directory on GitHub at https://github.com/benperk/ADE ➢ choose the Data menu item and select Create Table ➢ drag and drop the brainwavesMeditation.csv file onto the Upload File tab ➢ click the Create Table with UI button ➢ select the cluster from the Cluster drop‐down box ➢ click the Preview Table button ➢ click the First Row Is Header check box and change STRING to DOUBLE from the drop‐down text box under the VALUE heading ➢ and then click Create Table.
Select the Data menu item ➢ click the database you placed the table into ➢ click Create Table ➢ click the DBFS tab ➢ and then click FileStore. Notice the file that you just uploaded.
Select the Workspace menu item ➢ select Users ➢ select the down arrow next to your account ➢ select Create ➢ select Notebook ➢ provide a name ➢ set Python as the default Language ➢ select the cluster ➢ click Create ➢ enter the following code snippet into the cell ➢ and then run the code.

df = spark.read \.option(“header”,”true”).csv(“/FileStoretablesbrainwavesMeditation.csv”)df.write.mode(“overwrite”).format(“delta”).save(“/FileStore/ data/2022/03/14”)brainwaves=spark.read.format(“delta”).load(“FileStoredata/2022/03/14”)display(brainwaves)print(brainwaves.count())

5. Run the following code snippet:

display(spark.sql(“DROP TABLE IF EXISTS BRAINWAVES”))display(spark \.sql(“CREATE TABLE BRAINWAVES USING DELTALOCATION’FileStoredata/2022/03/14′”))display(spark.table(“BRAINWAVES”).select(“*”).show(10))print(spark.table(“BRAINWAVES”).select(“*”).count())
6. Upload the brainwavesPlayingGuitar.csv file using the same process performed in step 2 ➢ navigate back to the workspace ➢open the previous notebook ➢ and then execute the following code snippet:

df = spark.read \.option(“header”,”true”).csv(“/FileStore/tables/brainwaves PlayingGuitar.csv”)df.write.mode(“append”).format(“delta”).save(“/FileStore/data/2022/03/14/”)brainwaves =spark.read.format(“delta”).load(“/FileStore/data/2022/03/14”)print(brainwaves.count())display(spark.sql(“DROP TABLE IF EXISTS BRAINWAVES”))display(spark \.sql(“CREATE TABLE BRAINWAVES USING DELTA LOCATION ‘/FileStore/ data/2022/03/14′”))print(spark.table(“BRAINWAVES”).select(“*”).count())

7. Consider running the following code snippet, then perform some experiments with charting:

display(spark.sql(“SELECT SCENARIO, ELECTRODE, CAST(AVG(VALUE) AS DECIMAL(7,3)) as AverageReading, PERCENTILE(VALUE, 0.5) as MedianValue FROM brainwaves GROUP BY SCENARIO, ELECTRODE”))

The code snippets and importable notebook are available in the folder Chapter03/Ch03Ex15 on GitHub at https://github.com/benperk/ADE.

When you first uploaded the CSV file and created a table, it was not a Delta Lake table. That means the benefits, which you will read about later, would not be realized while performing your data transformations and analytics. The first code snippet in Exercise 3.15 loaded the data from the Spark table into a DataFrame, then converted it into the delta‐supported format. The conversion was achieved by using the .format(“delta”) method. The second code snippet used the Parquet file created by the first code snippet to create a delta table, then queried the table and counted the number of total rows. In a real‐world example, it would be expected to receive files at different intervals, which is what the third code snippet attempts to represent. That code snippet loads another CSV file containing brain waves, converts it to delta format, and appends it to the existing dataset. The append is achieved by using the .mode(“append”) method, instead of, for example, the .mode(“overwrite”) method. The existing table is dropped, then re‐created using the appended delta‐compliant file, then queried again. The final code snippet performs some high‐level review of the brain wave data.

In an effort to transpose this exercise onto a DLZ or DQL architecture flow, consider the CSV files and their associated Spark tables as raw or bronze. Consider the two actions that converted the data into delta format and appended the multiple files as pipeline work necessary to get the data from bronze into a more enriched state like silver. The final code snippet is an example of some preliminary analytics that help confirm the data is in a decent state to begin its final stage into a gold, business‐ready state. This process could and should be automated using a scheduled job once the process is finalized and happens consistently. Remember that you can configure a job to run the contents of a notebook, so once the data is flowing into the defined location, create the job and you have a pipeline that is performing ingestion and transformation from bronze to silver. There are two missing parts to the entire process here. First is the automated ingestion; remember that you manually placed the data onto the platform. The second is the transformation to gold and the serving of the data to consumers. This final process is what this book is attempting to accomplish; remember, you are still in the ingestion chapter, so there is much more to come regarding this.

Write a Comment Cancel reply