ACID Transactions – Data Sources and Ingestion
Anyone who has worked with computers, even for a short amount of time, knows that unexpected events happen. You can track this down all the way to the 3.3V of electricity required to flip a transistor to true. Everything from there has fault tolerance built in to manage these unexpected events in the best way possible. Exceptions happen all the time, but because of the layers of technologies that exist between the transistors and you, those exceptions are concealed. Most of the time you do not even know they happened, because they get handled and self‐healed. One such technology that abstracts and isolates exceptions from users and processes is atomicity, consistency, isolation, and durability (ACID) transaction enforcement.
Apache Spark, which Azure Databricks runs as the compute node, is not ACID‐compliant. The feature is found in Delta Lake; therefore, if you use Delta Lake on Azure Databricks, you can achieve ACID compliance. Consider the following scenarios in which ACID compliance makes a difference:
- Failed appends or overwrites
- Concurrent reading and writing of data
Atomicity means that the transaction is completely successful; otherwise, it is rolled back. Consider the execution of the code snippet df.write.mode(“append”) and assume an exception happened before all the data was appended to the file. Since Apache Spark does not support atomic transaction, you can end up with missing and lost data. When you execute the code df.write.mode(“overwrite”), it results in the existing file being deleted and a new one created. This can fail the consistency test because if the method fails, data can be lost. The rule of consistency states that data must always be in a valid state, and with this process, there is a time when there is no data. This can also result in the noncompliance to durability, which states that once data is committed, it is never lost. Delta Lake has built‐in capabilities to manage these noncompliant ACID capabilities that Apache Spark is missing.
In a majority of enterprise production scenarios, data sources are simultaneously written to by more than a single individual or program. When the volume of changes to the data source is high, the frequency of data changes is also high. That means that a program that retrieves and processes data from the source could receive a different result when compared to another program a few seconds later. This is where isolation comes into play. To be compliant, an operation must be isolated from other concurrent operations so that other concurrent operations do not impact the current one. Just imagine what happens if someone executes df.write.mode(“overwrite”), which takes a minute to complete, and another program attempts to read that file 10 seconds after the overwrite started. The overwrite does impact the rule of isolation. Many DBMS products provide the option to provide an isolation level to data, which includes read uncommitted, read committed, repeatable read, and serializable. Since Apache Spark does not provide this, you can use Delta Lake to find options to make your Azure Databricks data analytics solution ACID‐compliant.