top of page
  • Writer's pictureHarini Mallawaarachchi

Analyze data with Apache Spark


Apache Spark is an open source engine for distributed data processing, and is widely used to explore, process, and analyze huge volumes of data in data lake storage. Spark is available as a processing option in many data platform products, including Azure HDInsight, Azure Databricks, Azure Synapse Analytics, and Microsoft Fabric. One of the benefits of Spark is support for a wide range of programming languages, including Java, Scala, Python, and SQL; making Spark a very flexible solution for data processing workloads including data cleansing and manipulation, statistical analysis and machine learning, and data analytics and visualization.


This lab will take approximately 45 minutes to complete.



Note: You need a Microsoft school or work account to complete this exercise. If you don’t have one, you can sign up for a trial of Microsoft Office 365 E3 or higher.


Create a workspace

Create a lakehouse and upload files

Create a notebook

Load data into a dataframe

Explore data in a dataframe

Use Spark to transform data files

Work with tables and SQL

Visualize data with Spark

Save the notebook and end the Spark session

Clean up resources




2 views0 comments

Recent Posts

See All

Comments


bottom of page