L22-Automate an Azure Databricks Notebook with Azure Data Factory

Harini Mallawaarachchi
Dec 22, 2023
1 min read

You can use notebooks in Azure Databricks to perform data engineering tasks, such as processing data files and loading data into tables. When you need to orchestrate these tasks as part of a data engineering pipeline, you can use Azure Data Factory.

Before you start

You'll need an Azure subscription in which you have administrative-level access.

Review the What is Azure Data Factory?. article in the Azure Synapse Analytics documentation.

Import a notebook

You can create notebooks in your Azure Databricks workspace to run code written in a range of programming languages. In this exercise, you'll import an existing notebook that contains some Python code.

In the Azure portal, browse to the dp203-xxxxxxx resource group that was created by the script (or the resource group containing your existing Azure Databricks workspace)
Select your Azure Databricks Service resource (named databricksxxxxxxx if you used the setup script to create it).
In the Overview page for your workspace, use the Launch Workspace button to open your Azure Databricks workspace in a new browser tab; signing in if prompted.
View the Azure Databricks workspace portal and note that the sidebar on the left side contains icons for the various tasks you can perform.
In the sidebar on the left, select Workspace. Then select the ⌂ Home folder.
At the top of the page, in the ⋮ menu next to your user name, select Import. Then in the Import dialog box, select URL and import the notebook from https://github.com/MicrosoftLearning/dp-203-azure-data-engineer/raw/master/Allfiles/labs/27/Process-Data.ipynb
Review the contents of the notebook, which include some Python code cells to:

Retrieve a parameter named folder if it is has been passed (otherwise use a default value of data).
Download data from GitHub and save it in the specified folder in the Databricks File System (DBFS).
Exit the notebook, returning the path where the data was saved as an output

Enable Azure Databricks integration with Azure Data Factory

To use Azure Databricks from an Azure Data Factory pipeline, you need to create a linked service in Azure Data Factory that enables access to your Azure Databricks workspace.

Generate an access token

In the Azure Databricks portal, at on the top right menu bar, select the username and then select User Settings from the drop-down.
In the User Settings page, select Developer. Then next to Access tokens select Manage.
Select Generate new token and generate a new token with the comment Data Factory and a blank lifetime (so the token doesn't expire). Be careful to copy the token when it is displayed before selecting Done.
Paste the copied token to a text file so you have it handy for later in this exercise.

Create a linked service in Azure Data Factory

Return to the Azure portal, and in the dp203-xxxxxxx resource group, select the adfxxxxxxx Azure Data Factory resource.
On the Overview page, select the Launch studio to open the Azure Data Factory Studio. Sign in if prompted.
In Azure Data Factory Studio, use the >> icon to expand the navigation pane on the left. Then select the Manage page.
On the Manage page, in the Linked services tab, select + New to add a new linked service.
In the New linked service pane, select the Compute tab at the top. Then select Azure Databricks.
Continue, and create the linked service with the following settings:

Name: AzureDatabricks
Description: Azure Databricks workspace
Connect via integration runtime: AutoResolveInegrationRuntime
Account selection method: From Azure subscription
Azure subscription: Select your subscription
Databricks workspace: Select your databricksxxxxxxx workspace
Select cluster: New job cluster
Databrick Workspace URL: Automatically set to your Databricks workspace URL
Authentication type: Access token
Access token: Paste your access token
Cluster version: 13.3 LTS (Spark 3.4.1, Scala 2.12)
Cluster node type: Standard_DS3_v2
Python version: 3
Worker options: Fixed
Workers: 1

Use a pipeline to run the Azure Databricks notebook

Now that you have created a linked service, you can use it in a pipeline to run the notebook you viewed previously.

Create a pipeline

In Azure Data Factory Studio, in the navigation pane, select Author.
On the Author page, in the Factory Resources pane, use the + icon to add a Pipeline.
In the Properties pane for the new pipeline, change its name to Process Data with Databricks. Then use the Properties button (which looks similar to 🗏*) on the right end of the toolbar to hide the Properties pane.
In the Activities pane, expand Databricks and drag a Notebook activity to the pipeline designer surface.
With the new Notebook1 activity selected, set the following properties in the bottom pane:

General:
Name: Process Data
Azure Databricks:
Databricks linked service: Select the AzureDatabricks linked service you created previously
Settings:
Notebook path: Browse to the Users/your_user_name folder and select the Process-Data notebook
Base parameters: Add a new parameter named folder with the value product_data

Use the Validate button above the pipeline designer surface to validate the pipeline. Then use the Publish all button to publish (save) it.

Run the pipeline

Above the pipeline designer surface, select Add trigger, and then select Trigger now.
In the Pipeline run pane, select OK to run the pipeline.
In the navigation pane on the left, select Monitor and observe the Process Data with Databricks pipeline on the Pipeline runs tab. It may take a while to run as it dynamically creates a Spark cluster and runs the notebook. You can use the ↻ Refresh button on the Pipeline runs page to refresh the status.
When the run succeeds, select its name to view the run details. Then, on the Process Data with Databricks page, in the Activity Runs section, select the Process Data activity and use its output icon to view the output JSON from the activity, which should resemble this:

{
    "runPageUrl": "https://adb-..../run/...",
    "runOutput": "dbfs:/product_data/products.csv",
    "effectiveIntegrationRuntime": "AutoResolveIntegrationRuntime (East US)",
    "executionDuration": 61,
    "durationInQueue": {
        "integrationRuntimeQueue": 0
    },
    "billingReference": {
        "activityType": "ExternalActivity",
        "billableDuration": [
            {
                "meterType": "AzureIR",
                "duration": 0.03333333333333333,
                "unit": "Hours"
            }
        ]
    }
}

Note the runOutput value, which is the path variable to which the notebook saved the data.