top of page

L20-Use Delta Lake in Azure Databricks

Writer's picture: Harini MallawaarachchiHarini Mallawaarachchi

Delta Lake is an open source project to build a transactional data storage layer for Spark on top of a data lake. Delta Lake adds support for relational semantics for both batch and streaming data operations, and enables the creation of a Lakehouse architecture in which Apache Spark can be used to process and query data in tables that are based on underlying files in the data lake.



Before you start

You'll need an Azure subscription in which you have administrative-level access.

Review the Introduction to Delta Technologies article in the Azure Synapse Analytics documentation.



Create a cluster

Azure Databricks is a distributed processing platform that uses Apache Spark clusters to process data in parallel on multiple nodes. Each cluster consists of a driver node to coordinate the work, and worker nodes to perform processing tasks.

Tip: If you already have a cluster with a 13.3 LTS runtime version in your Azure Databricks workspace, you can use it to complete this exercise and skip this procedure.
  1. In the Azure portal, browse to the dp203-xxxxxxx resource group that was created by the script (or the resource group containing your existing Azure Databricks workspace)

  2. Select your Azure Databricks Service resource (named databricksxxxxxxx if you used the setup script to create it).

  3. In the Overview page for your workspace, use the Launch Workspace button to open your Azure Databricks workspace in a new browser tab; signing in if prompted.

  4. View the Azure Databricks workspace portal and note that the sidebar on the left side contains icons for the various tasks you can perform.

  5. Select the (+) New task, and then select Cluster.

  6. In the New Cluster page, create a new cluster with the following settings:

  • Cluster name: User Name's cluster (the default cluster name)

  • Cluster mode: Single Node

  • Access mode: Single user (with your user account selected)

  • Databricks runtime version: 13.3 LTS (Spark 3.4.1, Scala 2.12)

  • Use Photon Acceleration: Selected

  • Node type: Standard_DS3_v2

  • Terminate after 30 minutes of inactivity

  1. Wait for the cluster to be created. It may take a minute or two.

Note: If your cluster fails to start, your subscription may have insufficient quota in the region where your Azure Databricks workspace is provisioned. See CPU core limit prevents cluster creation for details. If this happens, you can try deleting your workspace and creating a new one in a different region. You can specify a region as a parameter for the setup script like this: ./setup.ps1 eastus

Explore delta lake using a notebook




0 views0 comments

Recent Posts

See All

L19-Use Spark in Azure Databricks

DP-203-Labs-19 Azure Databricks is a Microsoft Azure-based version of the popular open-source Databricks platform. Azure Databricks is...

Comments


bottom of page