PySpark Installation


 Do pyspark fun it will be, they said. I had a pretty frustrating week just trying to install and get it working. primarily because I am not very technical and secondary, everything is a huge clusterfuck.


Where do we start, First to use PySpark - You need Spark - and to use Spark ideally it should be installed on hadoop platform. 


To install hadoop you can download CDH from cloudera website as my friends did.


Steps I followed :


  1. Install virtual machine on your windows system 
    1. Everyone preferred Oracle's virtual box because free and trusted
      1. Done..Easy Peasy
  2. Download CDH ISO file from cloudera
    • This is very things started getting bonkers
      • Cloudera did away with download of CDH in favor of CDP.
      • What is CDP you ask ?
        • It is like combination of CDH and HDP.
          • You don't care about it ? Fair enough neither did I
      • Problem is for CDH they give ISO file for CDP they give commands which you can run on linux machine to download it, that for 60 days only
  3. Now I realize I have to download linux OS on my virtual machine to execute the command
    1. So after a youtube video of how to do it, I install Ubuntu
      1. I was prematurely happy it how smoothly it was going. 
      2. Ubuntu got installed alright. But I could not execute those commands to download CD because my (also latest) version of UBUNTU is not compatible with CDP 
    2. So another try I download Oracle OS and check compatible version
      1. Great this is also done.
        1. But it doesn't have wget installed in it and I cannot run CDP commands.
        2. I search and search to no avail to resolve the issue, download wget using curl or yum or something nothing worked
    3. Screw Oracle I'll go back to Ubuntu and install compatible version
      1. Wohoo first command, second command worked like charm or so I thought
      2. I get error for third command some files missing
      3. After several hours of hunting I get several solutions none of which worked efficiently
      4. In the end it came to give an error about missing library and logs mentioned no space in cache directory. 
      5. After few more tries I surrendered
At this point I thought very well no hadoop, I'll do pyspark on windows directly

  • Downloaded Java (JDK 15)
  • Downloaded Spark
  • Download Anaconda
  • Set path as mentioned in youtube video
    • Finally got pyspark working on shell
    • Still could not manage to get it running on jupyter notebook
    • Till I researched bit more and created file with three lines to be executed from Aanaconda Shell
      • set PYSPARK_DRIVER_PYTHON=jupyter
      • set PYSPARK_DRIVER_PYTHON_OPTS='notebook'
      • pyspark
  • After 100 hours sunk in, one fight with wife, one wasted workday, 4 weekdays, 100 of forums I could finally run a sample spark program

Comments

Popular posts from this blog

The Secret behind A B I N I T I O.

Finding Skewed Tables in Teradata

Macro to generate Mload Script