top of page

Explore a relational data warehouse

  • Dec 18, 2023
  • 1 min read

Azure Synapse Analytics is built on a scalable set capabilities to support enterprise data warehousing; including file-based data analytics in a data lake as well as large-scale relational data warehouses and the data transfer and transformation pipelines used to load them. In this lab, you’ll explore how to use a dedicated SQL pool in Azure Synapse Analytics to store and query data in a relational data warehouse.


Before you start

You'll need an Azure subscription in which you have administrative-level access.

Review the What is dedicated SQL pool in Azure Synapse Analytics? article in the Azure Synapse Analytics documentation.



Explore the data warehouse schema

In this lab, the data warehouse is hosted in a dedicated SQL pool in Azure Synapse Analytics.


Start the dedicated SQL pool

  1. After the script has completed, in the Azure portal, go to the dp500-xxxxxxx resource group that it created, and select your Synapse workspace.

  2. In the Overview page for your Synapse workspace, in the Open Synapse Studio card, select Open to open Synapse Studio in a new browser tab; signing in if prompted.

  3. On the left side of Synapse Studio, use the ›› icon to expand the menu - this reveals the different pages within Synapse Studio that are used to manage resources and perform data analytics tasks.

  4. On the Manage page, ensure the SQL pools tab is selected and then select the sqlxxxxxxx dedicated SQL pool and use its icon to start it; confirming that you want to resume it when prompted.


  1. Wait for the SQL pool to resume. This can take a few minutes. Use the ↻ Refresh button to check its status periodically. The status will show as Online when it is ready.View the tables in the database


Note: The serverless SQL pool, Built-in, is immediately available for your workspace. Dedicated SQL pools can be configured to adapt to team or organizational requirements and constraints. Learn more
The size of a dedicated SQL pool (formerly SQL DW) is determined by Data Warehousing Units (DWU).

View the tables in the database

  1. In Synapse Studio, select the Data page and ensure that the Workspace tab is selected and contains a SQL database category.

  2. Expand SQL database, the sqlxxxxxxx pool, and its Tables folder to see the tables in the database.

  3. Expand the dbo.FactInternetSales table and its Columns folder to see the columns in this table. Note that many of the columns are keys that reference rows in the dimension tables. Others are numeric values (measures) for analysis.


  1. View the columns for the dbo.DimPromotion table, and note that it has a unique PromotionKey that uniquely identifies each row in the table. It also has an AlternateKey.


  1. View the columns for the dbo.DimProduct, and note that it contains a ProductSubcategoryKey column, which references the dbo.DimProductSubcategory table, which in turn contains a ProductCategoryKey column that references the dbo.DimProductCategory table.


  1. View the columns for the dbo.DimDate table, and note that it contains multiple columns that reflect different temporal attributes of a date - including the day of week, day of month, month, year, day name, month name, and so on.


Query the data warehouse tables

Now that you have explored some of the more important aspects of the data warehouse schema, you’re ready to to query the tables and retrieve some data.


Query fact and dimension tables

Numeric values in a relational data warehouse are stored in fact tables with related dimension tables that you can use to aggregate the data across multiple attributes. This design means that most queries in a relational data warehouse involve aggregating and grouping data (using aggregate functions and GROUP BY clauses) across related tables (using JOIN clauses).

  1. On the Data page, select the sqlxxxxxxx SQL pool and in its menu, select New SQL script > Empty script.

  2. When a new SQL Script 1 tab opens, in its Properties pane, change the name of the script to Analyze Internet Sales and change the Result settings per query to return all rows. Then use the Publish button on the toolbar to save the script, and use the Properties button (which looks similar to 🗏.) on the right end of the toolbar to close the Properties pane so you can see the script pane.

In the empty script, add the following code:

SELECT  d.CalendarYear AS Year, 
		SUM(i.SalesAmount) AS InternetSalesAmount 
FROM FactInternetSales AS i 
JOIN DimDate AS d ON i.OrderDateKey = d.DateKey
GROUP BY d.CalendarYear 
ORDER BY Year;
  1. Use the ▷ Run button to run the script, and review the results, which should show the Internet sales totals for each year. This query joins the fact table for Internet sales to a time dimension table based on the order date, and aggregates the sales amount measure in the fact table by the calendar month attribute of the dimension table.



Modify the query as follows to add the month attribute from the time dimension, and then run the modified query.

SELECT   d.CalendarYear AS Year,
		d.MonthNumberOfYear AS Month,
		SUM(i.SalesAmount) AS InternetSalesAmount 
FROM FactInternetSales AS i
JOIN DimDate AS d ON i.OrderDateKey = d.DateKey
GROUP BY d.CalendarYear, d.MonthNumberOfYear 
ORDER BY Year, Month;

Note that the attributes in the time dimension enable you to aggregate the measures in the fact table at multiple hierarchical levels - in this case, year and month. This is a common pattern in data warehouses.



Modify the query as follows to remove the month and add a second dimension to the aggregation, and then run it to view the results (which show yearly Internet sales totals for each region):

SELECT   d.CalendarYear AS Year,
		g.EnglishCountryRegionName AS Region, 
		SUM(i.SalesAmount) AS InternetSalesAmount 
FROM FactInternetSales AS i
JOIN DimDate AS d ON i.OrderDateKey = d.DateKey
JOIN DimCustomer AS c ON i.CustomerKey = c.CustomerKey
JOIN DimGeography AS g ON c.GeographyKey = g.GeographyKey
GROUP BY d.CalendarYear, g.EnglishCountryRegionName
ORDER BY Year, Region;

Note that geography is a snowflake dimension that is related to the Internet sales fact table through the customer dimension. You therefore need two joins in the query to aggregate Internet sales by geography.


Modify and re-run the query to add another snowflake dimension and aggregate the yearly regional sales by product category:


SELECT  d.CalendarYear AS Year, 
        pc.EnglishProductCategoryName AS ProductCategory,
        g.EnglishCountryRegionName AS Region,
        SUM(i.SalesAmount) AS InternetSalesAmount
FROM FactInternetSales AS i
JOIN DimDate AS d ON i.OrderDateKey = d.DateKey
JOIN DimCustomer AS c ON i.CustomerKey = c.CustomerKey
JOIN DimGeography AS g ON c.GeographyKey = g.GeographyKey
JOIN DimProduct AS p ON i.ProductKey = p.ProductKey
JOIN DimProductSubcategory AS ps ON p.ProductSubcategoryKey = ps.ProductSubcategoryKey
JOIN DimProductCategory AS pc ON ps.ProductCategoryKey = pc.ProductCategoryKey
GROUP BY d.CalendarYear, pc.EnglishProductCategoryName, g.EnglishCountryRegionName
ORDER BY Year, ProductCategory, Region;

This time, the snowflake dimension for product category requires three joins to reflect the hierarchical relationship between products, subcategories, and categories.


Publish the script to save it.


Use ranking functions

Another common requirement when analyzing large volumes of data is to group the data by partitions and determine the rank of each entity in the partition based on a specific metric.

Under the existing query, add the following SQL to retrieve sales values for 2022 over partitions based on country/region name:


SELECT   g.EnglishCountryRegionName AS Region, 
		ROW_NUMBER() 
		OVER(PARTITION BY g.EnglishCountryRegionName 
ORDER BY i.SalesAmount ASC) AS RowNumber,
i.SalesOrderNumber AS OrderNo,
i.SalesOrderLineNumber AS LineItem,
i.SalesAmount AS SalesAmount,
SUM(i.SalesAmount) OVER(PARTITION BY g.EnglishCountryRegionName) AS RegionTotal, 
AVG(i.SalesAmount) OVER(PARTITION BY g.EnglishCountryRegionName) AS RegionAverage 
FROM FactInternetSales AS i 
JOIN DimDate AS d ON i.OrderDateKey = d.DateKey 
JOIN DimCustomer AS c ON i.CustomerKey = c.CustomerKey 
JOIN DimGeography AS g ON c.GeographyKey = g.GeographyKey 
WHERE d.CalendarYear = 2022
ORDER BY Region;

Select only the new query code, and use the ▷ Run button to run it. Then review the results, which should look similar to the following table:


  1. Observe the following facts about these results:

  • There’s a row for each sales order line item.

  • The rows are organized in partitions based on the geography where the sale was made.

  • The rows within each geographical partition are numbered in order of sales amount (from smallest to highest).

  • For each row, the line item sales amount as well as the regional total and average sales amounts are included.

 SELECT  g.EnglishCountryRegionName AS Region,
         g.City,
         SUM(i.SalesAmount) AS CityTotal,
         SUM(SUM(i.SalesAmount)) OVER(PARTITION BY g.EnglishCountryRegionName) AS RegionTotal,
         RANK() OVER(PARTITION BY g.EnglishCountryRegionName
                     ORDER BY SUM(i.SalesAmount) DESC) AS RegionalRank
 FROM FactInternetSales AS i
 JOIN DimDate AS d ON i.OrderDateKey = d.DateKey
 JOIN DimCustomer AS c ON i.CustomerKey = c.CustomerKey
 JOIN DimGeography AS g ON c.GeographyKey = g.GeographyKey
 GROUP BY g.EnglishCountryRegionName, g.City
 ORDER BY Region;

Select only the new query code, and use the ▷ Run button to run it. Then review the results, and observe the following:

  • The results include a row for each city, grouped by region.

  • The total sales (sum of individual sales amounts) is calculated for each city

  • The regional sales total (the sum of the sum of sales amounts for each city in the region) is calculated based on the regional partition.

  • The rank for each city within its regional partition is calculated by ordering the total sales amount per city in descending order.


Publish the updated script to save the changes.


Tip: ROW_NUMBER and RANK are examples of ranking functions available in Transact-SQL. For more details, see the Ranking Functions reference in the Transact-SQL language documentation.

Retrieve an approximate count

When exploring very large volumes of data, queries can take significant time and resources to run. Often, data analysis doesn’t require absolutely precise values - a comparison of approximate values may be sufficient.


Under the existing queries, add the following code to retrieve the number of sales orders for each calendar year:

SELECT 	d.CalendarYear AS CalendarYear,
 		COUNT(DISTINCT i.SalesOrderNumber) AS Orders 
FROM FactInternetSales AS i
JOIN DimDate AS d ON i.OrderDateKey = d.DateKey 
GROUP BY d.CalendarYear ORDER BY CalendarYear;

Select only the new query code, and use the ▷ Run button to run it. Then review the output that is returned:

  • On the Results tab under the query, view the order counts for each year.


  • On the Messages tab, view the total execution time for the query.

Modify the query as follows, to return an approximate count for each year. Then re-run the query.

SELECT  d.CalendarYear AS CalendarYear, 
		APPROX_COUNT_DISTINCT(i.SalesOrderNumber) AS Orders 
FROM FactInternetSales AS i 
JOIN DimDate AS d ON i.OrderDateKey = d.DateKey 
GROUP BY d.CalendarYear 
ORDER BY CalendarYear;
  1. Review the output that is returned:

  • On the Results tab under the query, view the order counts for each year. These should be within 2% of the actual counts retrieved by the previous query.

  • On the Messages tab, view the total execution time for the query. This should be shorter than for the previous query.


Publish the script to save the changes.

Tip: See the APPROX_COUNT_DISTINCT function documentation for more details.

Challenge - Analyze reseller sales

  1. Create a new empty script for the sqlxxxxxxx SQL pool, and save it with the name Analyze Reseller Sales.

  2. Create SQL queries in the script to find the following information based on the FactResellerSales fact table and the dimension tables to which it is related:

The total quantity of items sold per fiscal year and quarter.

SELECT
    d.FiscalYear,
    d.FiscalQuarter,
    SUM(r.OrderQuantity) AS TotalQuantity
FROM
    FactResellerSales AS r
JOIN
    DimDate AS d ON r.OrderDateKey = d.DateKey
GROUP BY
    d.FiscalYear, d.FiscalQuarter;


  • The total quantity of items sold per fiscal year, quarter, and sales territory region associated with the employee who made the sale.


SELECT
    d.FiscalYear,
    d.FiscalQuarter,
    st.SalesTerritoryRegion,
    SUM(r.OrderQuantity) AS TotalQuantity
FROM
    FactResellerSales AS r
JOIN
    DimDate AS d ON r.OrderDateKey = d.DateKey
JOIN
    DimSalesTerritory AS st ON r.SalesTerritoryKey = st.SalesTerritoryKey
GROUP BY
    d.FiscalYear, d.FiscalQuarter, st.SalesTerritoryRegion;


  • The total quantity of items sold per fiscal year, quarter, and sales territory region by product category.


SELECT
    d.FiscalYear,
    d.FiscalQuarter,
    st.SalesTerritoryRegion,
    pc.EnglishProductCategoryName AS ProductCategory,
    SUM(r.OrderQuantity) AS TotalQuantity
FROM
    FactResellerSales AS r
JOIN
    DimDate AS d ON r.OrderDateKey = d.DateKey
JOIN
    DimSalesTerritory AS st ON r.SalesTerritoryKey = st.SalesTerritoryKey
JOIN
    DimProduct AS p ON r.ProductKey = p.ProductKey
JOIN
    DimProductSubcategory AS ps ON p.ProductSubcategoryKey = ps.ProductSubcategoryKey
JOIN
    DimProductCategory AS pc ON ps.ProductCategoryKey = pc.ProductCategoryKey
GROUP BY
    d.FiscalYear, d.FiscalQuarter, st.SalesTerritoryRegion, pc.EnglishProductCategoryName;

  • The rank of each sales territory per fiscal year based on total sales amount for the year.


WITH SalesTerritoryRanking AS (
    SELECT
        st.SalesTerritoryRegion,
        d.FiscalYear,
        RANK() OVER (PARTITION BY d.FiscalYear ORDER BY SUM(r.SalesAmount) DESC) AS TerritoryRank
    FROM
        FactResellerSales AS r
    JOIN
        DimDate AS d ON r.OrderDateKey = d.DateKey
    JOIN
        DimSalesTerritory AS st ON r.SalesTerritoryKey = st.SalesTerritoryKey
    GROUP BY
        st.SalesTerritoryRegion, d.FiscalYear
)
SELECT
    SalesTerritoryRegion,
    FiscalYear,
    TerritoryRank
FROM
    SalesTerritoryRanking;


  • The approximate number of sales order per year in each sales territory.


SELECT
    st.SalesTerritoryRegion,
    d.FiscalYear,
    COUNT(DISTINCT r.SalesOrderNumber) AS NumberOfOrders
FROM
    FactResellerSales AS r
JOIN
    DimDate AS d ON r.OrderDateKey = d.DateKey
JOIN
    DimSalesTerritory AS st ON r.SalesTerritoryKey = st.SalesTerritoryKey
GROUP BY
    st.SalesTerritoryRegion, d.FiscalYear;

Experiment with queries to explore the rest of the tables in the data warehouse schema as your leisure.


When you’re done, on the Manage page, pause the sqlxxxxxxx dedicated SQL pool.




Recent Posts

See All

Comments


bottom of page