top of page

Installing PySpark on Windows: A Step-by-Step Guide

  • Writer: Harini Mallawaarachchi
    Harini Mallawaarachchi
  • Dec 30, 2023
  • 1 min read

Apache Spark is a powerful open-source distributed computing system that provides fast and general-purpose cluster-computing frameworks for big data processing. PySpark, the Python API for Apache Spark, allows developers to harness the capabilities of Spark using Python programming language. While PySpark is commonly used in a Linux environment, installing it on a Windows machine can be a bit challenging. In this blog post, we will walk you through the steps to successfully install PySpark on a Windows system.


Step 1 - Install JDK

PySpark requires Java to run. Download and install the latest JDK from the official Oracle website. Spark runs on Java 8, 11, or 17.

Java Development Kit (JDK): JDK 17


install in the below path

C:/java/jdk

To verify the installed Java version, run the below line in cmd.

java --version



Step 2 - Install Spark

Navigate to the Spark website downloads. Select the latest version.



Download and use Winrar/ 7zip to extract it's content.

Extract into the below path

C:\Users\harin\PySpark-Learn


Step 3 - Install Python

Install Python Sparks only runs on Python 3.7+



After installation, activate it and we can see the env name at the beginning of the prompt.

python -m venv .pyspark-env
.pyspark-env\Scripts\activate


Step 4 - Install pyspark, JupyterLab

Now install pyspark, findspark, and JupyterLab, the popular notebook interface for Python.


Navigate into the relevant folder path in cmd.

cd C:\Users\harin\PySpark-Learn

Run the below commands one by one in cmd in the folder path where you installed Python.

pip install pyspark

pip install findspark

pip install jupyterlab


Step 5 - Launch JupyterLab and use PySpark

jupyter-lab

After the installation, launch JupyterLab and use PySpark

jupyter-lab

This will launch Jupyter Lab in your default web browser window as below.





Step 6 - Test PySpark Installation

Create Notebook.










And run notebook 1 from GIT Repo


Note: Open Jupiter Lab

Now you've installed all the required software for Pyspark.

Whenever you want to run JupiterLab, simply run the below lines in cmd to open the notebook interface.

cd C:\Users\harin\PySpark-Learn
python -m venv .pyspark-env
.pyspark-env\Scripts\activate
jupyter-lab





Recent Posts

See All

Comments


bottom of page