L18-Explore Azure Databricks

Harini Mallawaarachchi
Dec 22, 2023
1 min read

Azure Databricks is a Microsoft Azure-based version of the popular open-source Databricks platform.

Similarly to Azure Synapse Analytics, an Azure Databricks workspace provides a central point for managing Databricks clusters, data, and resources on Azure.

Before you start

You'll need an Azure subscription in which you have administrative-level access.

Review the What is Azure Databricks? article in the Azure Synapse Analytics documentation.

Create a cluster

Azure Databricks is a distributed processing platform that uses Apache Spark clusters to process data in parallel on multiple nodes. Each cluster consists of a driver node to coordinate the work, and worker nodes to perform processing tasks.

In this exercise, you'll create a single-node cluster to minimize the compute resources used in the lab environment (in which resources may be constrained). In a production environment, you'd typically create a cluster with multiple worker nodes.

Tip: If you already have a cluster with a 13.3 LTS runtime version in your Azure Databricks workspace, you can use it to complete this exercise and skip this procedure.

In the Azure portal, browse to the dp203-xxxxxxx resource group that was created by the script (or the resource group containing your existing Azure Databricks workspace)
Select your Azure Databricks Service resource (named databricksxxxxxxx if you used the setup script to create it).
In the Overview page for your workspace, use the Launch Workspace button to open your Azure Databricks workspace in a new browser tab; signing in if prompted.

Tip: As you use the Databricks Workspace portal, various tips and notifications may be displayed. Dismiss these and follow the instructions provided to complete the tasks in this exercise.

View the Azure Databricks workspace portal and note that the sidebar on the left side contains links for the various types of task you can perform.
Select the (+) New link in the sidebar, and then select Cluster.

In the New Cluster page, create a new cluster with the following settings:

Cluster name: User Name's cluster (the default cluster name)
Cluster mode: Single Node
Access mode: Single user (with your user account selected)
Databricks runtime version: 13.3 LTS (Spark 3.4.1, Scala 2.12)
Use Photon Acceleration: Selected
Node type: Standard_DS3_v2
Terminate after 30 minutes of inactivity

Wait for the cluster to be created. It may take a minute or two.

Note: If your cluster fails to start, your subscription may have insufficient quota in the region where your Azure Databricks workspace is provisioned. See CPU core limit prevents cluster creation for details. If this happens, you can try deleting your workspace and creating a new one in a different region. You can specify a region as a parameter for the setup script like this: ./setup.ps1 eastus

Use Spark to analyze a data file

As in many Spark environments, Databricks supports the use of notebooks to combine notes and interactive code cells that you can use to explore data.

In the sidebar, use the (+) New link to create a Notebook.
Change the default notebook name (Untitled Notebook [date]) to Explore products and in the Connect drop-down list, select your cluster if it is not already selected. If the cluster is not running, it may take a minute or so to start.
Download the products.csv file to your local computer, saving it as products.csv. Then, in the Explore products notebook, on the File menu, select Upload data to DBFS.
In the Upload Data dialog box, note the DBFS Target Directory to where the file will be uploaded. Then select the Files area, and upload the products.csv file you downloaded to your computer. When the file has been uploaded, select Next
In the Access files from notebooks pane, select the sample PySpark code and copy it to the clipboard. You will use it to load the data from the file into a DataFrame. Then select Done.
In the Explore products notebook, in the empty code cell, paste the code you copied; which should look similar to this:

df1 = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/user@outlook.com/products.csv")

Use the ▸ Run Cell menu option at the top-right of the cell to run it, starting and attaching the cluster if prompted.
Wait for the Spark job run by the code to complete. The code has created a dataframe object named df1 from the data in the file you uploaded.
Under the existing code cell, use the + icon to add a new code cell. Then in the new cell, enter the following code:

display(df1)

Use the ▸ Run Cell menu option at the top-right of the new cell to run it. This code displays the contents of the dataframe, which should look similar to this:

Above the table of results, select + and then select Visualization to view the visualization editor.

Then apply the following options:

Visualization type: Bar
X Column: Category
Y Column: Add a new column and select ProductID. Apply the Count aggregation.

Save the visualization and observe that it is displayed in the notebook, like this:

Create and query a table

While many data analysis are comfortable using languages like Python or Scala to work with data in files, a lot of data analytics solutions are built on relational databases; in which data is stored in tables and manipulated using SQL.

In the Explore products notebook, under the chart output from the previously run code cell, use the + icon to add a new cell.
Enter and run the following code in the new cell:

df1.write.saveAsTable("products")

When the cell has completed, add a new cell under it with the following code:

%sql

SELECT ProductName, ListPrice
FROM products
WHERE Category = 'Touring Bikes';

Run the new cell, which contains SQL code to return the name and price of products in the Touring Bikes category.
In the sidebar, select the Catalog link, and verify that the products table has been created in the default database schema (which is unsurprisingly named default). It's possible to use Spark code to create custom database schemas and a schema of relational tables that data analysts can use to explore data and generate analytical reports.

L18-Explore Azure Databricks

Before you start

Create a cluster

Use Spark to analyze a data file

Create and query a table

Recent Posts

Comments