PySpark Installation
Do pyspark fun it will be, they said. I had a pretty frustrating week just trying to install and get it working. primarily because I am not very technical and secondary, everything is a huge clusterfuck.
Where do we start, First to use PySpark - You need Spark - and to use Spark ideally it should be installed on hadoop platform.
To install hadoop you can download CDH from cloudera website as my friends did.
Steps I followed :
- Install virtual machine on your windows system
 - Everyone preferred Oracle's virtual box because free and trusted
 - Done..Easy Peasy
 - Download CDH ISO file from cloudera
 - This is very things started getting bonkers
 - Cloudera did away with download of CDH in favor of CDP.
 - What is CDP you ask ?
 - It is like combination of CDH and HDP.
 - You don't care about it ? Fair enough neither did I
 - Problem is for CDH they give ISO file for CDP they give commands which you can run on linux machine to download it, that for 60 days only
 - Now I realize I have to download linux OS on my virtual machine to execute the command
 - So after a youtube video of how to do it, I install Ubuntu
 - I was prematurely happy it how smoothly it was going.
 - Ubuntu got installed alright. But I could not execute those commands to download CD because my (also latest) version of UBUNTU is not compatible with CDP
 - So another try I download Oracle OS and check compatible version
 - Great this is also done.
 - But it doesn't have wget installed in it and I cannot run CDP commands.
 - I search and search to no avail to resolve the issue, download wget using curl or yum or something nothing worked
 - Screw Oracle I'll go back to Ubuntu and install compatible version
 - Wohoo first command, second command worked like charm or so I thought
 - I get error for third command some files missing
 - After several hours of hunting I get several solutions none of which worked efficiently
 - In the end it came to give an error about missing library and logs mentioned no space in cache directory.
 - After few more tries I surrendered
 
At this point I thought very well no hadoop, I'll do pyspark on windows directly
- Downloaded Java (JDK 15)
 - Downloaded Spark
 - Download Anaconda
 - Set path as mentioned in youtube video
 - Finally got pyspark working on shell
 - Still could not manage to get it running on jupyter notebook
 - Till I researched bit more and created file with three lines to be executed from Aanaconda Shell
 - set PYSPARK_DRIVER_PYTHON=jupyter
 - set PYSPARK_DRIVER_PYTHON_OPTS='notebook'
 - pyspark
 
- After 100 hours sunk in, one fight with wife, one wasted workday, 4 weekdays, 100 of forums I could finally run a sample spark program
 
Comments
Post a Comment