Clusters – Data Sources and Ingestion

In Exercise 3.14 you created an all‐purpose cluster. The attributes and configuration details were covered in the discussion following the exercise. In most cases an all‐purpose cluster and a job cluster are the same. The use case for the two are that you use an all‐purpose cluster to analyze data via notebooks. This analysis can be collaborative and interactive with a team. A job cluster is one allocated to the execution of automated and scheduled jobs. When your ingestion and analysis are complete, you move the final version of the notebook to be executed on a job cluster. Each cluster type provides the CPU and memory required to provide a performant platform. Azure Databricks has two types of clusters: interactive and automated. You use interactive clusters to analyze data collaboratively with interactive notebooks. You use automated clusters to run fast and robust automated jobs.

Pools

The concept of a pool spans across many scenarios and has the same purpose. A pool is a number of objects waiting to serve their intended purpose. Consider connection pools or thread pools, for example. It takes milliseconds to seconds to instantiate a database connection or a thread to perform the expected activity. Whereas a connection in a connection pool is waiting to be used as the means for the manipulation or retrieval of data, a thread in a thread pool is on standby, waiting to be instructed by the kernel to perform some execution of code. The same applies to a pool of nodes. If you want to improve the performance and start up times of the workloads running on Azure Databricks, then you can create a pool of nodes. These nodes will be provisioned and on standby, ready to execute the instructed code algorithms. This is in contrast to the time required to provision and configure a node before running the code. That provisioning and configuration can take minutes, in some cases, to come online and be ready to contribute to the compute needs. To avoid this delay, you create a pool of nodes, which are already online waiting for work allocations. Keep in mind, however, that you pay for these nodes, so you need to take actions to manage them as optimally as possible.

Cluster Policies

Controlling cost, security, and permissions is important. Azure includes a feature called Azure Policy that gives the subscription owner the means to control how products are configured from security and cost perspective. The cluster policies feature provides the same capability, in that it provides the means for controlling the size and the allowed cluster configurations, meaning you can control which components must or must not be existing on the cluster and/or the maximum supported worker size.

Jobs

The Jobs section provides an interface for viewing and managing the existing jobs on the workspace. The page renders details such as the name of the job, the ID of the job, who created the job, the task, and the cluster that is bound to the job. You can also delete a job or execute a job manually, as shown in Figure 3.74.

When you click the job name, the details of the selected job will appear. Notice that the allocated cluster is all‐purpose. If you click the Swap button, you are given the opportunity to switch it to a job cluster. You also can view the history of the execution of the jobs by selecting the Job Runs tab. The content on the Job Runs tab gives you the interface to drill down into each run to view logs, metrics, and the output of the job.

Pools

Cluster Policies

Jobs

Write a Comment Cancel reply