Upserts and Deletes – Data Sources and Ingestion

The example of combining multiple data files into a single table for analysis in Exercise 3.15 simply appended one file onto the next. This required you to remove and re‐create the table. Instead of dropping and re‐creating the table, you could also use UPDATE, MERGE, or DELETE capabilities to modify the data on a delta table. Performing an update is the same as with other SQL‐based datastores. The following syntax illustrates how to achieve this and is followed by the output:

UPDATE BRAINWAVES SET SCENARIO = ‘Flipboard’ WHERE SCENARIO = ‘FlipChart’;
+——————-+
| num_affected_rows |
+——————-+
| 23975             |
+——————-+

The MERGE SQL command looks something like the following. This command is often referred to as an upsert activity. An upsert is useful for retrieving rows from a view, a source table, or a DataFrame, and then placing them into a target delta table.

MERGE INTO BRAINWAVESUSING BRAINWAVESUPDATESON BRAINWAVES.ID = BRAINWAVESUPDATES.IDWHEN MATCHED THENUPDATE SET ID = BRAINWAVESUPDATES.ID,SESSION_DATETIME = BRAINWAVESUPDATES.SESSION_DATETIME,READING_DATETIME = BRAINWAVESUPDATES.READING_DATETIME,SCENARIO = BRAINWAVESUPDATES.SCENARIO,ELECTRODE = BRAINWAVESUPDATES.ELECTRODE,FREQUENCY = BRAINWAVESUPDATES.FREQUENCY,VALUE = BRAINWAVESUPDATES.VALUEWHEN NOT MATCHED THEN INSERT (ID, SESSION_DATETIME, READING_DATETIME,SCENARIO,ELECTRODE,FREQUENCY,VALUE)VALUES( BRAINWAVESUPDATES.ID,BRAINWAVESUPDATES.SESSION_DATETIME, BRAINWAVESUPDATES.READING_DATETIME, BRAINWAVESUPDATES.SCENARIO, BRAINWAVESUPDATES.ELECTRODE,BRAINWAVESUPDATES.FREQUENCY,BRAINWAVESUPDATES.VALUE)

The target delta table, BRAINWAVES, follows the INTO syntax, with the source table following USING. The criteria used to check for a match follows the ON syntax. The matching behavior resembles what you might find in the relational database context, where a primary key exists. As there is no referential integrity built into these tables, it is possible to make a match and perform an update that is not expected. For example, what if the IDs on the update table get reset? You would end up unintentionally overwriting existing data. Thank goodness for the history logs if this ever happens. Next, there is some code logic that uses WHEN and THEN to determine whether the row already exists. If there are matching IDs, then the data is updated; if there are no matching IDs for the given row, then the data is inserted.

You can remove rows from the delta table as follows, which is the same as with many other datastores:

DELETE FROM BRAINWAVES WHERE READING_DATETIME < ‘2022-01-01 00:00:00’

The DELETE SQL command will remove all the data that matches the WHERE clause. Again, this being a delta table, you have the ability to recover or query the table after this delete being performed.

Event Hubs and IoT Hub

Both Event Hubs and IoT Hub were introduced in Chapter 1 but are discussed in greater detail here, as they are both key contributors for data ingestion. Refer to Figure 1.22 to see how their deployment into the ingestion stage of your Big Data solution might look. IoT Hub is built on top of Event Hubs, so all capabilities found on Event Hubs are also available with IoT Hub, but not vice versa. Table 3.19 compares the two products. You can use this to determine which product you need for your ingestion scenario.

TABLE 3.19 Event Hubs vs. IoT Hub

FeatureEvent HubsIoT Hub
Send messages to the cloud from devices.YesYes
HTTPS, AMQP over WebSockets, AMQP.YesYes
MQTT over WebSockets, MQTT.NoYes
Unique identity per device.NoYes
Send messages from devices to the cloud.NoYes
Support for more than 5,000 concurrent connections.NoYes

IoT Hub supports a very useful option: bidirectional communications between the device and the cloud. This outbound message capability is focused more on an IoT solution than on a Big Data solution. The data capturing from devices in the cloud is more aligned with data analytics, although this feature allows tight integration with both, if your requirements necessitate it. IoT Hub also gives you the option to uniquely authenticate each IoT device with its own security credential, whereas Event Hubs authenticates using a shared access signature (SAS) token. A SAS token is a shared key used for authenticating a connection and resembles the following:

yCLIBBB76NKSLlQAOe64g8O2JyKMgqkZL91NsxKBLEI=

There are some security concerns around using a shared token for authentication. For example, it needs to be used from each device that needs access to the hub. It is also necessary to ensure that the credential does not become compromised. Using Azure Key Vault, for example, can help with this. IoT Hub supports managed identity, while Event Hubs does not currently support it. The last point to know is that IoT hubs costs much more than event hubs. Prior to provisioning an IoT hub, check the prices, as they can cost thousands of US dollars per month, even while idle.

Perform Exercise 3.16 to provision an Azure Event Hub.

Write a Comment

Your email address will not be published. Required fields are marked *