L19-Use Spark in Azure Databricks

Harini Mallawaarachchi
Dec 22, 2023
1 min read

Azure Databricks is a Microsoft Azure-based version of the popular open-source Databricks platform. Azure Databricks is built on Apache Spark, and offers a highly scalable solution for data engineering and analysis tasks that involve working with data in files. One of the benefits of Spark is support for a wide range of programming languages, including Java, Scala, Python, and SQL; making Spark a very flexible solution for data processing workloads including data cleansing and manipulation, statistical analysis and machine learning, and data analytics and visualization.

Before you start

You'll need an Azure subscription in which you have administrative-level access.

Review the Exploratory data analysis on Azure Databricks article in the Azure Synapse Analytics documentation.

Create a cluster

Azure Databricks is a distributed processing platform that uses Apache Spark clusters to process data in parallel on multiple nodes. Each cluster consists of a driver node to coordinate the work, and worker nodes to perform processing tasks.

Tip: If you already have a cluster with a 13.3 LTS runtime version in your Azure Databricks workspace, you can use it to complete this exercise and skip this procedure.

In the Azure portal, browse to the dp203-xxxxxxx resource group that was created by the script (or the resource group containing your existing Azure Databricks workspace)
Select your Azure Databricks Service resource (named databricksxxxxxxx if you used the setup script to create it).
In the Overview page for your workspace, use the Launch Workspace button to open your Azure Databricks workspace in a new browser tab; signing in if prompted.
View the Azure Databricks workspace portal and note that the sidebar on the left side contains icons for the various tasks you can perform.
Select the (+) New task, and then select Cluster.
In the New Cluster page, create a new cluster with the following settings:

Cluster name: User Name's cluster (the default cluster name)
Cluster mode: Single Node
Access mode: Single user (with your user account selected)
Databricks runtime version: 13.3 LTS (Spark 3.4.1, Scala 2.12)
Use Photon Acceleration: Selected
Node type: Standard_DS3_v2
Terminate after 30 minutes of inactivity

Wait for the cluster to be created. It may take a minute or two.

Note: If your cluster fails to start, your subscription may have insufficient quota in the region where your Azure Databricks workspace is provisioned. See CPU core limit prevents cluster creation for details. If this happens, you can try deleting your workspace and creating a new one in a different region. You can specify a region as a parameter for the setup script like this: ./setup.ps1 eastus

Explore data using a notebook

As in many Spark environments, Databricks supports the use of notebooks to combine notes and interactive code cells that you can use to explore data.

In the Azure Databricks workspace portal for your workspace, in the sidebar on the left, select Workspace. Then select the ⌂ Home folder.
At the top of the page, in the ⋮ menu next to your user name, select Import. Then in the Import dialog box, select URL and import the notebook from https://github.com/MicrosoftLearning/dp-203-azure-data-engineer/raw/master/Allfiles/labs/24/Databricks-Spark.ipynb

Connect the notebook to your cluster, and follow the instructions it contains; running the cells it contains to explore data in files.

Ingest data

%sh
rm -r /dbfs/data
mkdir /dbfs/data

wget -O /dbfs/data/2019.csv https://raw.githubusercontent.com/MicrosoftLearning/dp-203-azure-data-engineer/master/Allfiles/labs/24/data/2019.csv

wget -O /dbfs/data/2020.csv https://raw.githubusercontent.com/MicrosoftLearning/dp-203-azure-data-engineer/master/Allfiles/labs/24/data/2020.csv

wget -O /dbfs/data/2021.csv https://raw.githubusercontent.com/MicrosoftLearning/dp-203-azure-data-engineer/master/Allfiles/labs/24/data/2021.csv

Query data in files

The previous cell downloaded three comma-separated values (CSV) files to the data folder in the DBFS storage for your workspace.

Run the following cell to load the data from the file and view the first 100 rows.

df = spark.read.load('data/*.csv', format='csv')display(df.limit(100))

The data in the file relates to sales orders, but doesn't include the column headers or information about the data types. To make more sense of the data, you can define a schema for the dataframe.

from pyspark.sql.types import *from pyspark.sql.functions import *orderSchema = StructType([     StructField("SalesOrderNumber", StringType()),     StructField("SalesOrderLineNumber", IntegerType()),     StructField("OrderDate", DateType()),     StructField("CustomerName", StringType()),     StructField("Email", StringType()),     StructField("Item", StringType()),     StructField("Quantity", IntegerType()),     StructField("UnitPrice", FloatType()),     StructField("Tax", FloatType()) ])df = spark.read.load('/data/*.csv', format='csv', schema=orderSchema)display(df.limit(100))

This time the data includes the column headers.

To verify that the appropriate data types have been defined, you can view the schema of the dataframe.

df.printSchema()

Analyze data in a dataframe

The dataframe object in Spark is similar to a Pandas dataframe in Python, and includes a wide range of functions that you can use to manipulate, filter, group, and otherwise analyze the data it contains.

Filter a dataframe

Run the following cell to:

Filter the columns of the sales orders dataframe to include only the customer name and email address.
Count the total number of order records
Count the number of distinct customers
Display the customers

customers = df['CustomerName', 'Email']
print(customers.count())
print(customers.distinct().count())
display(customers.distinct())

Observe the following details:

When you perform an operation on a dataframe, the result is a new dataframe (in this case, a new customers dataframe is created by selecting a specific subset of columns from the df dataframe)
Dataframes provide functions such as count and distinct that can be used to summarize and filter the data they contain.
The dataframe['Field1', 'Field2', ...] syntax is a shorthand way of defining a subset of column. You can also use select method, so the first line of the code above could be written as customers = df.select("CustomerName", "Email")

Now let's apply a filter to include only the customers who have placed an order for a specific product:

customers = df.select("CustomerName", "Email").where(df['Item']=='Road-250 Red, 52')
print(customers.count())
print(customers.distinct().count())
display(customers.distinct())

Note that you can “chain” multiple functions together so that the output of one function becomes the input for the next - in this case, the dataframe created by the select method is the source dataframe for the where method that is used to apply filtering criteria.

Aggregate and group data in a dataframe

Run the following cell to aggregate and group the order data.

productSales = df.select("Item", "Quantity").groupBy("Item").sum()
display(productSales)

Note that the results show the sum of order quantities grouped by product. The groupBy method groups the rows by Item, and the subsequent sum aggregate function is applied to all of the remaining numeric columns (in this case, Quantity)

Let's try another aggregation.

yearlySales = df.select(year("OrderDate").alias("Year")).groupBy("Year").count().orderBy("Year")
display(yearlySales)

This time the results show the number of sales orders per year. Note that the select method includes a SQL year function to extract the year component of the OrderDate field, and then an alias method is used to assign a columm name to the extracted year value. The data is then grouped by the derived Year column and the count of rows in each group is calculated before finally the orderBy method is used to sort the resulting dataframe.

Query data using Spark SQL

As you’ve seen, the native methods of the dataframe object enable you to query and analyze data quite effectively. However, many data analysts are more comfortable working with SQL syntax. Spark SQL is a SQL language API in Spark that you can use to run SQL statements, or even persist data in relational tables.

Use Spark SQL in PySpark code

The default language in Azure Synapse Studio notebooks is PySpark, which is a Spark-based Python runtime. Within this runtime, you can use the spark.sql library to embed Spark SQL syntax within your Python code, and work with SQL constructs such as tables and views.

df.createOrReplaceTempView("salesorders")

spark_df = spark.sql("SELECT * FROM salesorders")
display(spark_df)

Observe that:

The code persists the data in the df dataframe as a temporary view named salesorders. Spark SQL supports the use of temporary views or persisted tables as sources for SQL queries.
The spark.sql method is then used to run a SQL query against the salesorders view.
The results of the query are stored in a dataframe.

Run SQL code in a cell

While it’s useful to be able to embed SQL statements into a cell containing PySpark code, data analysts often just want to work directly in SQL.

%sql
SELECT YEAR(OrderDate) AS OrderYear,
    SUM((UnitPrice * Quantity) + Tax) AS GrossRevenue
FROM salesorders
GROUP BY YEAR(OrderDate)
ORDER BY OrderYear;

Observe that:

The ``%sql` line at the beginning of the cell (called a magic) indicates that the Spark SQL language runtime should be used to run the code in this cell instead of PySpark.
The SQL code references the salesorder view that you created previously using PySpark.
The output from the SQL query is automatically displayed as the result under the cell.

Note: For more information about Spark SQL and dataframes, see the Spark SQL documentation.

Visualize data with Spark

A picture is proverbially worth a thousand words, and a chart is often better than a thousand rows of data. While notebooks in Azure Databricks include support for visualizing data from a dataframe or Spark SQL query, it is not designed for comprehensive charting. However, you can use Python graphics libraries like matplotlib and seaborn to create charts from data in dataframes.

View results as a visualization

Run the following cell to query the salesorders table

%sql
SELECT * FROM salesorders

Above the table of results, select + and then select Visualization to view the visualization editor, and then apply the following options: - Visualization type: Bar - X Column: Item - Y Column: Add a new column and select Quantity. Apply the Sum aggregation.

Save the visualization and then re-run the code cell to view the resulting chart in the notebook.

Get started with matplotlib

You can get more control over data visualizations by using graphics libraries.

Run the following cell to retrieve some sales order data into a dataframe.

sqlQuery = "SELECT CAST(YEAR(OrderDate) AS CHAR(4)) AS OrderYear, \
             SUM((UnitPrice * Quantity) + Tax) AS GrossRevenue \
         FROM salesorders \
         GROUP BY CAST(YEAR(OrderDate) AS CHAR(4)) \
         ORDER BY OrderYear"df_spark = spark.sql(sqlQuery)
df_spark.show()

To visualize the data as a chart, we’ll start by using the matplotlib Python library. This library is the core plotting library on which many others are based, and provides a great deal of flexibility in creating charts.

from matplotlib import pyplot as plt

# matplotlib requires a Pandas dataframe, not a Spark one
df_sales = df_spark.toPandas()

# Create a bar plot of revenue by year
plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'])

# Display the plot
plt.show()

Review the results, which consist of a column chart with the total gross revenue for each year. Note the following features of the code used to produce this chart:

The matplotlib library requires a Pandas dataframe, so you need to convert the Spark dataframe returned by the Spark SQL query to this format.
At the core of the matplotlib library is the pyplot object. This is the foundation for most plotting functionality.
The default settings result in a usable chart, but there’s considerable scope to customize it, as you'll see by running the following cell.

# Clear the plot area
plt.clf()

# Create a bar plot of revenue by year
plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'], color='orange')

# Customize the chart
plt.title('Revenue by Year')
plt.xlabel('Year')
plt.ylabel('Revenue')
plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
plt.xticks(rotation=45)

# Show the figure
plt.show()

A plot is technically contained with a Figure. In the previous examples, the figure was created implicitly for you; but you can create it explicitly.

# Clear the plot area
plt.clf()

# Create a Figure
fig = plt.figure(figsize=(8,3))

# Create a bar plot of revenue by year
plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'], color='orange')

# Customize the chart
plt.title('Revenue by Year')
plt.xlabel('Year')
plt.ylabel('Revenue')
plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
plt.xticks(rotation=45)

# Show the figure
plt.show()

A figure can contain multiple subplots, each on its own axis.

# Clear the plot area
plt.clf()

# Create a figure for 2 subplots (1 row, 2 columns)
fig, ax = plt.subplots(1, 2, figsize = (10,4))

# Create a bar plot of revenue by year on the first axis
ax[0].bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'], color='orange')
ax[0].set_title('Revenue by Year')

# Create a pie chart of yearly order counts on the second axis
yearly_counts = df_sales['OrderYear'].value_counts()
ax[1].pie(yearly_counts)
ax[1].set_title('Orders per Year')
ax[1].legend(yearly_counts.keys().tolist())

# Add a title to the Figure
fig.suptitle('Sales Data')

# Show the figure
plt.show()

Note: To learn more about plotting with matplotlib, see the matplotlib documentation.

Use the seaborn library

While matplotlib enables you to create complex charts of multiple types, it can require some complex code to achieve the best results. For this reason, over the years, many new libraries have been built on the base of matplotlib to abstract its complexity and enhance its capabilities. One such library is seaborn.

import seaborn as sns

# Clear the plot area
plt.clf()

# Create a bar chart
ax = sns.barplot(x="OrderYear", y="GrossRevenue", data=df_sales)
plt.show()

The seaborn library makes it simpler to create complex plots of statistical data, and enables you to control the visual theme for consistent data visualizations.

# Clear the plot area
plt.clf()

# Set the visual theme for seaborn
sns.set_theme(style="whitegrid")

# Create a bar chart
ax = sns.barplot(x="OrderYear", y="GrossRevenue", data=df_sales)
plt.show()

Both matplotlib and seaborn support multiple charts types. For example, run the following cell to view the yearly sales totals as a line chart.

# Clear the plot area
plt.clf()

# Create a bar chart
ax = sns.lineplot(x="OrderYear", y="GrossRevenue", data=df_sales)
plt.show()

Note: To learn more about plotting with seaborn, see the seaborn documentation.

In this notebook, you've explored some basic techniques for using Spark to explore data in files. To learn more about working with Dataframes in Azure Databricks using PySpark, see Introduction to DataFrames - Python in the Azure Databricks documentation.