top of page

Analyze data in a data lake with Spark

Writer's picture: Harini MallawaarachchiHarini Mallawaarachchi

Updated: Dec 18, 2023


Apache Spark is an open source engine for distributed data processing, and is widely used to explore, process, and analyze huge volumes of data in data lake storage. Spark is available as a processing option in many data platform products, including Azure HDInsight, Azure Databricks, and Azure Synapse Analytics on the Microsoft Azure cloud platform. One of the benefits of Spark is support for a wide range of programming languages, including Java, Scala, Python, and SQL; making Spark a very flexible solution for data processing workloads including data cleansing and manipulation, statistical analysis and machine learning, and data analytics and visualization.


Before you start

You'll need an Azure subscription in which you have administrative-level access.


Review the Apache Spark in Azure Synapse Analytics article in the Azure Synapse Analytics documentation.


Query data in files

Analyze data in a dataframe

Query data using Spark SQL

Visualize data with Spark



5 views0 comments

Recent Posts

See All

L20-Use Delta Lake in Azure Databricks

DP-203-Labs-20 Delta Lake is an open source project to build a transactional data storage layer for Spark on top of a data lake. Delta...

Comments


bottom of page