top of page

Transform files using a serverless SQL pool

Writer's picture: Harini MallawaarachchiHarini Mallawaarachchi

Data analysts often use SQL to query data for analysis and reporting. Data engineers can also make use of SQL to manipulate and transform data; often as part of a data ingestion pipeline or extract, transform, and load (ETL) process.

In this exercise, you'll use a serverless SQL pool in Azure Synapse Analytics to transform data in files.



Before you start

You'll need an Azure subscription in which you have administrative-level access.


Query data in files

The script provisions an Azure Synapse Analytics workspace and an Azure Storage account to host the data lake, then uploads some data files to the data lake.


View files in the data lake

  1. After the script has completed, in the Azure portal, go to the dp203-xxxxxxx resource group that it created, and select your Synapse workspace.

  2. In the Overview page for your Synapse workspace, in the Open Synapse Studio card, select Open to open Synapse Studio in a new browser tab; signing in if prompted.

  3. On the left side of Synapse Studio, use the ›› icon to expand the menu - this reveals the different pages within Synapse Studio that you'll use to manage resources and perform data analytics tasks.

  4. On the Data page, view the Linked tab and verify that your workspace includes a link to your Azure Data Lake Storage Gen2 storage account, which should have a name similar to synapsexxxxxxx (Primary - datalakexxxxxxx).

  5. Expand your storage account and verify that it contains a file system container named files.

  6. Select the files container, and note that it contains a folder named sales. This folder contains the data files you are going to query.

  7. Open the sales folder and the csv folder it contains, and observe that this folder contains .csv files for three years of sales data.

  8. Right-click any of the files and select Preview to see the data it contains. Note that the files contain a header row.

  9. Close the preview, and then use the button to navigate back to the sales folder.


Use SQL to query CSV files



Transform data using CREATE EXTERAL TABLE AS SELECT (CETAS) statements

A simple way to use SQL to transform data in a file and persist the results in another file is to use a CREATE EXTERNAL TABLE AS SELECT (CETAS) statement. This statement creates a table based on the requests of a query, but the data for the table is stored as files in a data lake. The transformed data can then be queried through the external table, or accessed directly in the file system (for example, for inclusion in a downstream process to load the transformed data into a data warehouse).


Create an external data source and file format

By defining an external data source in a database, you can use it to reference the data lake location where you want to store files for external tables. An external file format enables you to define the format for those files - for example, Parquet or CSV. To use these objects to work with external tables, you need to create them in a database other than the default master database.

  1. In Synapse Studio, on the Develop page, in the + menu, select SQL script.

  2. In the new script pane, add the following code (replacing datalakexxxxxxx with the name of your data lake storage account) to create a new database and add an external data source to it.


sql-- Database for sales data
CREATE DATABASE Sales
  COLLATE Latin1_General_100_BIN2_UTF8;
GO;

Use Sales;
GO;

-- External data is in the Files container in the data lake
CREATE EXTERNAL DATA SOURCE sales_data WITH (
    LOCATION = 'https://datalakexxxxxxx.dfs.core.windows.net/files/'
);
GO;

-- Format for table files
CREATE EXTERNAL FILE FORMAT ParquetFormat
    WITH (
            FORMAT_TYPE = PARQUET,
            DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
        );
GO;


  1. Modify the script properties to change its name to Create Sales DB, and publish it.

  2. Ensure that the script is connected to the Built-in SQL pool and the master database, and then run it.

  3. Switch back to the Data page and use the ↻ button at the top right of Synapse Studio to refresh the page. Then view the Workspace tab in the Data pane, where a SQL database list is now displayed. Expand this list to verify that the Sales database has been created.

  4. Expand the Sales database, its External Resources folder, and the External data sources folder under that to see the sales_data external data source you created.


Create an External table

Encapsulate data transformation in a stored procedure




3 views0 comments

Recent Posts

See All

L20-Use Delta Lake in Azure Databricks

DP-203-Labs-20 Delta Lake is an open source project to build a transactional data storage layer for Spark on top of a data lake. Delta...

Σχόλια


bottom of page