PySpark Installation

October 05, 2020

Do pyspark fun it will be, they said. I had a pretty frustrating week just trying to install and get it working. primarily because I am not very technical and secondary, everything is a huge clusterfuck.

Where do we start, First to use PySpark - You need Spark - and to use Spark ideally it should be installed on hadoop platform.

To install hadoop you can download CDH from cloudera website as my friends did.

Steps I followed :

Install virtual machine on your windows system

Everyone preferred Oracle's virtual box because free and trusted

Done..Easy Peasy

Download CDH ISO file from cloudera

This is very things started getting bonkers

Cloudera did away with download of CDH in favor of CDP.
What is CDP you ask ?

It is like combination of CDH and HDP.

You don't care about it ? Fair enough neither did I

Problem is for CDH they give ISO file for CDP they give commands which you can run on linux machine to download it, that for 60 days only

Now I realize I have to download linux OS on my virtual machine to execute the command

So after a youtube video of how to do it, I install Ubuntu

I was prematurely happy it how smoothly it was going.
Ubuntu got installed alright. But I could not execute those commands to download CD because my (also latest) version of UBUNTU is not compatible with CDP

So another try I download Oracle OS and check compatible version

Great this is also done.

But it doesn't have wget installed in it and I cannot run CDP commands.
I search and search to no avail to resolve the issue, download wget using curl or yum or something nothing worked

Screw Oracle I'll go back to Ubuntu and install compatible version

Wohoo first command, second command worked like charm or so I thought
I get error for third command some files missing
After several hours of hunting I get several solutions none of which worked efficiently
In the end it came to give an error about missing library and logs mentioned no space in cache directory.
After few more tries I surrendered

At this point I thought very well no hadoop, I'll do pyspark on windows directly

Downloaded Java (JDK 15)
Downloaded Spark
Download Anaconda
Set path as mentioned in youtube video

Finally got pyspark working on shell
Still could not manage to get it running on jupyter notebook
Till I researched bit more and created file with three lines to be executed from Aanaconda Shell

set PYSPARK_DRIVER_PYTHON=jupyter
set PYSPARK_DRIVER_PYTHON_OPTS='notebook'
pyspark

After 100 hours sunk in, one fight with wife, one wasted workday, 4 weekdays, 100 of forums I could finally run a sample spark program

Search This Blog

From Database to MPP to Data Lake

PySpark Installation

Comments

Post a Comment

Popular posts from this blog

The Secret behind A B I N I T I O.

Macro to generate Mload Script