Monday, December 29, 2014

Why did I use BigData Technology (Spark) for Machine Learning (NLP)


I am here to do Sentiment Analysis of twitter dataset and trying to make it a generalize platform  irrespective of twitter or any other source , so dimension and size of training data is actually large and it will grow on.

As a developer of Machine Learning, I thought of trying Python as the technology and for trial I was having around 1600000 records from

Why not R?

Well first I found R is not same traditional way of coding so comfort was question but making things in R is extremely easy. So that can't be the only reason to try in python.

Python is nice language and NLTK library made NLP very easy and loads of example are there and at-least NLP is very powerful with nltk library along with that scikit-learn library (machine language algorithms).

Its very easy to code in Python compare to R in my view.

Brilliant support with Hadoop Streaming for python so I will vote up for Python again.

So here I got an opportunity to try something new and thats always exciting :-)

(I am not saying Python is better than R , as I use both of them , in this particular scenario , for me , python stood)

So how to do that ?

Well Sentiment Analysis is not just 1 thing that I get the text and ran algorithm and here we go. Its never like that , once you get data you need to decide polarity on it I mean first need to find 

any statement is positive or negative
Then create a feature set
The find stop words
Then bla bla bla ...
Then finally train it
Then test accuracy
Then finally predict it

I means it has loads of step before reaching the goal so it no way an ordinary task but once you are clear with the concept, its neither tough.

check Text Processing

Whats more ?

So I first tried with small dataset of around 400 records having 100 test records and ran NaiveBayes from nltk

 classifier = NaiveBayesClassifier.train(X_train)  

and it ran in very small time of around 10 sec and then some accuracy test with precision test more 10-20 seconds and thats done.
The file size was approx. 500kb thats all and it took 20 to 30 seconds.

Cool thats brilliant , 30 seconds and all set.
But is it enough ? Are we set ?

Then why do I need Bigdata framework ?

On exploring more I found there is a brilliant sentiment dataset available in sentiment140 and the file size was about 250mb or precisely the records number are 1600000, thats a pretty decent size for getting a proper result of sentiment analysis.

So I was all set and ran the same algorithm with this new dataset or file. The file contains positive and negative statements collected from twitter so I changed the file name and ran the python script.

Damm!!! my system is quite good config even memory exception . What... can't I run this, well no I can't. I tried several times and I tried the same with high config Windows as well as Mac machine and same thing, Memory Error.

So I was left with only one choice , try the same in Bigdata platform and see atleast it must run there and I ran the same program but in Hadoop YARN environment & Spark and finally it ran but at what cost ???
It took more than 14 minutes to finish the job.

So 14 minutes for 1600000 records , but I need to deal with 20 times of this size , how the hell I am gonna do that.
20 * 14 minutes in simple mathematics and that won't happen because resource utilisation , memory factor will anyway raise the time more and more.

So is it really a feasible option?

Now Spark fits here .

Spark has a brilliant library MLlib which contains almost all known popular Machine Learning API's Spark Library
As I already told that I am using NaiveBayes, Spark has already API

So I need to change in the code, because Spark works in the concept of RDD ie resilient distributed dataset RDD Concept

I am surely not gonna explain the changes but I must admit that there were pretty important changes almost everything to parallelise the process the basic concept behind Bigdata framework including Spark.

Now I was set and ran the Spark in my local system and YARn (no cluster) and surprisingly it took around 2.4 minutes and I added more extensive operation on the code and maximum it went till 3.8 minutes. Find below -


In current days, when there is unlimited data source then its very hard to restrict yourself if there are technologies available to handle the size and dimension of data and Spark is an excellent fit for Big Data but in machine learning its not only about running the algorithm but ignoring those now BigData platform or technology stack is actually brilliant source.

So moving on I am going to use Spark for Text processing but I got one more great option GraphLab , so once I try that I will compare Spark and GraphLab and initial research showed me GraphLab is little bit faster than Spark

All my codes are already in github

Proper Sentiment Analysis in Spark
Python NLTK Naive Bayes Example
Analysis in YARN Long Running