Wednesday, 23 November 2016

Pyspark Computing Pilot :The Installation

To start it off. Apologies for late submission of the blog. Got stuck with the Wordpress. Finally decided to go with the good old Google Blogspot.
Setting all Apologies aside, lets Start.

Firstly I use a Ubuntu Xenial Xerus OS on Lenvo laptop with i7 processor.
You may ask why not windows!

I am unaware of any reason to concern yourself with the OS choice. It's a language choice, and Python is the most obvious recommendation. Python is popular, growing, runs on virtually every OS, has a myriad of choices for data analysis libraries,  and is pretty easy to begin working with.

I use Linux since it's more developer friendly operating system and gives me better flexibility if I want to create virtual env's, get libraries installed fast and moreover companies prefer Linux since it's open source and ideal for day to day operations. 

 Simply put it I love Free Community Learning.(Go Open Source!!!!)


 Second Question Why Spark???

One word Answer:Speed. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.

N.B.:As opposed to  Hadoop. Spark uses RAM instead of Hardware to do Number crunching.

 Now lets Start.

 Open the terminal and check which version of Linux is there
Ctrl+Alt+T would start it. Then type python.


Step1:Downloading Anaconda 4.2.0 inside Ubuntu:
Inside Ubuntu open internet browser > paste the following link > https://www.continuum.io/downloads > download Anaconda 4.0.2 for windows > once download is finished > open the terminal > type following command after $ symbol >             sudo bash  “home/Ankit/download/Anaconda3-4.2.0-Windows-x86_64.exe > press enter >


now the installation takes place > once installation is done you get a message Anaconda 4.0.2 is successfully installed in command line >  close the terminal and again open it > now to check whether Anaconda4.0.2 got installed properly > open terminal > type python in command line prompt and press enter >

Please note when the install scripts asks if Anaconda should be placed in the system path, please say 'yeDownload and configure Spark

The latest version of Spark can be downloaded from here
Select the latest Spark release
Select the package type as 'Pre-built for Hadoop 2.7 and later
Select Direct Download as download type and finally click the download Spark link to begin your download

After completion of download, unzip the .tgz file and move the extracted directory to an easily accessible location. In my case it was /home/sompal/Spark (Here Spark is the renamed name of the directory you just moved)

Now open a terminal and run

sudo gedit .profile

This will open the .profile file in the default text editor. Add the following lines at the end of the file
export SPARK_HOME=/home/username/Spark
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
export SPARK_LOCAL_IP=LOCALHOST
To get the correct version of the py4j-n.n-src.zip file go to $SPARK_HOME/python/lib and see the actual value

Click on save to save the file and exit. Then to reload the edited profile run

source ~/.profile

Now the system is configured for running Spark

To run Spark, open a terminal and type 'pyspark' and press ENTER
  So far so good...
Stay tuned for some Machine Learning using Spark.

Let us write the famous Word Count program to begin with. For this program to work you will need a text file.

Lets use one of my favorite poem Invictus by
Then enter each of these lines as commands at the pyspark prompt

text = sc.textFile("datafile")
print text
from operator import add
def tokenize(text):
    return text.split()
words = text.flatMap(tokenize)
print words
wc = words.map(lambda x: (x,1))
print wc.toDebugString()
counts = wc.reduceByKey(add)
counts.saveAsTextFile("output-dir")

So far so good...
Stay tuned for some Machine Learning using Spark.