PySpark Installation
Do pyspark fun it will be, they said. I had a pretty frustrating week just trying to install and get it working. primarily because I am not very technical and secondary, everything is a huge clusterfuck.
Where do we start, First to use PySpark - You need Spark - and to use Spark ideally it should be installed on hadoop platform.
To install hadoop you can download CDH from cloudera website as my friends did.
Steps I followed :
- Install virtual machine on your windows system
- Everyone preferred Oracle's virtual box because free and trusted
- Done..Easy Peasy
- Download CDH ISO file from cloudera
- This is very things started getting bonkers
- Cloudera did away with download of CDH in favor of CDP.
- What is CDP you ask ?
- It is like combination of CDH and HDP.
- You don't care about it ? Fair enough neither did I
- Problem is for CDH they give ISO file for CDP they give commands which you can run on linux machine to download it, that for 60 days only
- Now I realize I have to download linux OS on my virtual machine to execute the command
- So after a youtube video of how to do it, I install Ubuntu
- I was prematurely happy it how smoothly it was going.
- Ubuntu got installed alright. But I could not execute those commands to download CD because my (also latest) version of UBUNTU is not compatible with CDP
- So another try I download Oracle OS and check compatible version
- Great this is also done.
- But it doesn't have wget installed in it and I cannot run CDP commands.
- I search and search to no avail to resolve the issue, download wget using curl or yum or something nothing worked
- Screw Oracle I'll go back to Ubuntu and install compatible version
- Wohoo first command, second command worked like charm or so I thought
- I get error for third command some files missing
- After several hours of hunting I get several solutions none of which worked efficiently
- In the end it came to give an error about missing library and logs mentioned no space in cache directory.
- After few more tries I surrendered
At this point I thought very well no hadoop, I'll do pyspark on windows directly
- Downloaded Java (JDK 15)
- Downloaded Spark
- Download Anaconda
- Set path as mentioned in youtube video
- Finally got pyspark working on shell
- Still could not manage to get it running on jupyter notebook
- Till I researched bit more and created file with three lines to be executed from Aanaconda Shell
- set PYSPARK_DRIVER_PYTHON=jupyter
- set PYSPARK_DRIVER_PYTHON_OPTS='notebook'
- pyspark
- After 100 hours sunk in, one fight with wife, one wasted workday, 4 weekdays, 100 of forums I could finally run a sample spark program
Comments
Post a Comment