Monday, December 29, 2014

Why did I use BigData Technology (Spark) for Machine Learning (NLP)


I am here to do Sentiment Analysis of twitter dataset and trying to make it a generalize platform  irrespective of twitter or any other source , so dimension and size of training data is actually large and it will grow on.

As a developer of Machine Learning, I thought of trying Python as the technology and for trial I was having around 1600000 records from

Why not R?

Well first I found R is not same traditional way of coding so comfort was question but making things in R is extremely easy. So that can't be the only reason to try in python.

Python is nice language and NLTK library made NLP very easy and loads of example are there and at-least NLP is very powerful with nltk library along with that scikit-learn library (machine language algorithms).

Its very easy to code in Python compare to R in my view.

Brilliant support with Hadoop Streaming for python so I will vote up for Python again.

So here I got an opportunity to try something new and thats always exciting :-)

(I am not saying Python is better than R , as I use both of them , in this particular scenario , for me , python stood)

So how to do that ?

Well Sentiment Analysis is not just 1 thing that I get the text and ran algorithm and here we go. Its never like that , once you get data you need to decide polarity on it I mean first need to find 

any statement is positive or negative
Then create a feature set
The find stop words
Then bla bla bla ...
Then finally train it
Then test accuracy
Then finally predict it

I means it has loads of step before reaching the goal so it no way an ordinary task but once you are clear with the concept, its neither tough.

check Text Processing

Whats more ?

So I first tried with small dataset of around 400 records having 100 test records and ran NaiveBayes from nltk

 classifier = NaiveBayesClassifier.train(X_train)  

and it ran in very small time of around 10 sec and then some accuracy test with precision test more 10-20 seconds and thats done.
The file size was approx. 500kb thats all and it took 20 to 30 seconds.

Cool thats brilliant , 30 seconds and all set.
But is it enough ? Are we set ?

Then why do I need Bigdata framework ?

On exploring more I found there is a brilliant sentiment dataset available in sentiment140 and the file size was about 250mb or precisely the records number are 1600000, thats a pretty decent size for getting a proper result of sentiment analysis.

So I was all set and ran the same algorithm with this new dataset or file. The file contains positive and negative statements collected from twitter so I changed the file name and ran the python script.

Damm!!! my system is quite good config even memory exception . What... can't I run this, well no I can't. I tried several times and I tried the same with high config Windows as well as Mac machine and same thing, Memory Error.

So I was left with only one choice , try the same in Bigdata platform and see atleast it must run there and I ran the same program but in Hadoop YARN environment & Spark and finally it ran but at what cost ???
It took more than 14 minutes to finish the job.

So 14 minutes for 1600000 records , but I need to deal with 20 times of this size , how the hell I am gonna do that.
20 * 14 minutes in simple mathematics and that won't happen because resource utilisation , memory factor will anyway raise the time more and more.

So is it really a feasible option?

Now Spark fits here .

Spark has a brilliant library MLlib which contains almost all known popular Machine Learning API's Spark Library
As I already told that I am using NaiveBayes, Spark has already API

So I need to change in the code, because Spark works in the concept of RDD ie resilient distributed dataset RDD Concept

I am surely not gonna explain the changes but I must admit that there were pretty important changes almost everything to parallelise the process the basic concept behind Bigdata framework including Spark.

Now I was set and ran the Spark in my local system and YARn (no cluster) and surprisingly it took around 2.4 minutes and I added more extensive operation on the code and maximum it went till 3.8 minutes. Find below -


In current days, when there is unlimited data source then its very hard to restrict yourself if there are technologies available to handle the size and dimension of data and Spark is an excellent fit for Big Data but in machine learning its not only about running the algorithm but ignoring those now BigData platform or technology stack is actually brilliant source.

So moving on I am going to use Spark for Text processing but I got one more great option GraphLab , so once I try that I will compare Spark and GraphLab and initial research showed me GraphLab is little bit faster than Spark

All my codes are already in github

Proper Sentiment Analysis in Spark
Python NLTK Naive Bayes Example
Analysis in YARN Long Running

Tuesday, November 25, 2014

Use Python for Recommendation Engine

Python For Recommendation Engine

well here Page1 we got the data and we fetched associated details with each of the dataset , alright now lets roll , will just write the algorithm and its done...

Good Morning .. its real world and nothing happens so easily here ..

Why did I say so ?

   Var1 Freq
    N/A  405

Because I have 405 NA values in imdbRating .. it does mean I have no records available for 405 among the list of 1173 total movies.. That sick because then almost 40% of the dataset is invalid.

So what should I do , should I simply dump the data and why it happened.

I can't dump that much data and it happened because there are movies which has no details in IMDB, may be because of some movies are based on particular zone only nt globally known or any other reason.

What Should I Do ?

I will cleanse and shape my data.

Lets consider a movie named KNHN and now I will try to find how many fb friends' like the movie lets say 10, now I can compare that number to the movie having highest viewer among my list ie 3 idiots which is 30.

So now I know among highest number 30% liked KNHN as well , so weightage of this movie is bit high. So naturally is above average rating ie 5.0 , so again I added 30% of highest rating 8.4 to 5.0 and finally I am close to some value around 7.0 (example).

So are we finished here , not possible , there are lot more fields to address like imdbUserVotes , rottenUserVotes. So what should I do here.

Simple I will use linear Regression or precisely Machine Learning ml to do this . Bingo !!!! I used scipy library of python to run polynomial regression-

  result = sm.ols(  
           formula="tomatoUserReviews ~ imdbRating + tomatoRating + I(genre1 ** 1.0) +I(genre2 ** 1.0)+I(genre3 ** 4.0)", data=df2).fit()  
         pretomatoRating = result.predict(  
           pd.DataFrame({'imdbRating': [newRating], 'tomatoRating': [newRating],'genre1':[0],'genre2':[0],'genre3':[0]}))  
         pretomatoRating = int(round(pretomatoRating))  
         if pretomatoRating < 0:  
           pretomatoRating = 10000  
         self.df2.loc[index, "tomatoUserReviews"] = pretomatoRating  

So now my dataset is quite preety

imdbRating imdbVotes imdbID tomatoRating tomatoUserReviews
1 6.20 89960 6.20 3504356
2 6.40 5108 tt2678948 6.30 152
3 6.80 35 tt0375174 6.30 125
4 7.20 99763 tt0087538 6.90 314496
5 6.90 231169 tt1068680 5.30 316060
6 7.70 291519 tt0289879 4.80 621210
7 6.90 174047 tt1401152 5.80 74879
8 7.20 1038 tt0452580 6.30 253
9 4.90 10381 6.30 6006
10 7.20 28 tt0297197 6.30 6006
11 4.90 10381 tt1278160 6.30 6006
12 4.90 10381 6.30 6006
13 4.90 10381 6.30 6006
14 3.20 690 tt2186731 6.30 862
15 6.20 5594 tt1252596 6.30 2436

So now all of my movies have Ratings and user reviews (almost).

So Now I can run my recommendation

Shaping The Final Outcome

Now I have all the required variables in the dataset I can finally run my distance algorithm to conclude which is the next best movie.

For rating as the data is consistent I used euclidean but votes are very sparse so I used consine distance to get the data.

   def calculatedistance(self, movie1, movie2):  
     FEATURES = [  
       'imdbRating', 'imdbVotes', 'tomatoRating', 'tomatoUserReviews']  
     movie1DF = df2[df2.MOVIE == movie1]  
     movie2DF = df2[df2.MOVIE == movie2]  
     rating1 = (float(movie1DF.iloc[0]['imdbRating']), float(  
     rating2 = (float(movie2DF.iloc[0]['imdbRating']), float(  
     review1 = (long(movie1DF.iloc[0]['imdbVotes']), long(  
     review2 = (long(movie2DF.iloc[0]['imdbVotes']), long(  
     # ValueError  
     distances = []  
     distances.append(round(distance.euclidean(rating1, rating2), 2))  
                Since votes have sparse data , so i preffered to use cosine rather euclidean..  
     distances.append(round(distance.cosine(review1, review2), 2))  
     return distances  

Python code has -

User Recommendation - recommendMovie
Distance Comparison between 2 movies - calculatedistance
Distance comparison between multiple movies - compareOneMovieWithMultiple
Distance Comparision Among Movies - findSimilarityBetweenMovies

and few more.

One of sample I have uploaded in cloud - Distance Calculated between Movies 

Entire Source Code - git link

Few More analytic in R will be uploaded.

What More Can be done ?

A Lot more like lots of movies I couldn't extract the Genre , so I can get that from using text extraction from twitter and get the best result.

Lots of basic ratings as well can be derived from Sentiment Analysis of the movie name in different crawler.

Recommendation Algorithm can be made more versatile .

Recommendation Engine using Python

What is Recommendation System

Well if you are reading this you anyway know what is recommendation but basically its giving a suggestion to the user based on certain Parameters like in Amazon , on buying books it gives you suggestion , in IMDB on checking any movie, it gives you recommended movies etc.

Now lets think how is that possible?

Answer - its Science or I must say Mathematics.

Let's play with some normal logic here-

Lets consider Movie 'The Godfather'

Now if somebody likes the same, can we recommend the same user any superhit comedy movie ... aah Naah... Not possible , so naturally we would like to recommend next best movie under Genre Crime or Drama or Thriller.
So thats a recommendation.

Hold on!... As a human I can search and find but hows is that possible using computing languages.

So now we got a point and I'd like to push you all in past , schooling days.
'Euclidean Distance or Manhattan Distance' ..

Are those words seems to be familier to you .. well at first sight it may be like "What d hell this guy is talking about" or what ... why should I know mathematics here.

Because its mathematics only :-)

The distance between two points in a grid based on a strictly horizontal and/or vertical path (that is, along the grid lines), as opposed to the diagonal or "as the crow flies" distance. The Manhattan distance is the simple sum of the horizontal and vertical components, whereas the diagonal distance might be computed by applying the Pythagorean theorem.

For more details , Google and know the actual concept behind them. That's important.

What Exactly I did with R and Python

I used R for do quick analytics on my data and used python to write the main algorithm behind it.
I must say everything I did in Python, I could have been done in R as well but I used python because I am little more comfortable in programming language rather than scientific one ie R. My personal opinion and please don't believe me :)

Concept behind the entire approach

If I have a collection of movies , then I can extract details about each of the movies like user ratings , reviews , Genre and all other things easily and if that is possible then I can use any distance algorithm like euclidean to find the similarity between any 2 movies.

Now I will pick the movies in each of my friends' list from facebook and then will compare the same with the list of unwatched movies and sort them based on distance , so now I have a list of movies recommendation start with the best till the end.

Is it actually so easy ????

Explanation based on Programming

First I connected to Facebook using Oauth and REST so that I could get access to my friends and then the movies liked by my friends.

 graph = facebook.GraphAPI(access_token)  
 profile = graph.get_object("me")  
 friends = graph.get_connections("me", 'friends')['data']  

Then I used algorithm to get movies like by my friend

 allLikes = graph.get_connections(friend['id'], "movies")['data']  

Now next job is to get the details about which of the movies like rating and all other things and finally I got following contents-

MOVIE Year Released Genre Director Poster imdbRating imdbVotes imdbID tomatoRating tomatoUserReviews BoxOffice
2 Jilla 2014 10/01/14 Action, Drama, Thriller R.T. Neason 6.4 5108 tt2678948 N/A 152 N/A
3 Pursuit of Happyness 2005 16/07/05 Documentary Patrick McGuinn 6.8 35 tt0375174 N/A 125 N/A
4 The Karate Kid 1984 22/06/84 Action, Drama, Family John G. Avildsen 7.2 99763 tt0087538 6.9 314496 N/A
5 Yes Man 2008 19/12/08 Comedy, Romance Peyton Reed 6.9 231169 tt1068680 5.3 316060 $97.6M
6 The Butterfly Effect 2004 23/01/04 Sci-Fi, Thriller Eric Bress, J. Mackye Gruber 7.7 291519 tt0289879 4.8 621210 $57.7M
7 Unknown 2011 18/02/11 Action, Mystery, Thriller Jaume Collet-Serra 6.9 174047 tt1401152 5.8 74879 $63.7M
8 A Year Ago in Winter 2008 06/01/10 Drama Caroline Link 7.2 1038 tt0452580 N/A 253 N/A
10 James Bond 007 1983 N/A Adventure, Animation, Action N/A N/A 7.2 28 tt0297197 N/A N/A N/A
14 Department 2012 18/05/12 Action Ram Gopal Varma N/A 3.2 690 tt2186731 N/A 862 N/A
15 Ajab Prem Ki Ghazab Kahani 2009 06/11/09 Comedy, Romance Rajkumar Santoshi 6.2 5594 tt1252596 N/A 2436 N/A

Entire code to get the data can be found facebook_movies_dataset

Since now I have data, so why not run some quick analytic on them and see have I really did any good job -

So from the graph above , we can easily see that 3 idiots is the most viewed movie among my FB movies list and Gaurav Shr.. has liked the most number of movies , so naturally he doesn't have any work :)

Source Movies_Analytics

So Now I have dataset which give me some values and and now I can work on my recommendation engine... if you have taken some breath .. lets move on to next important thing ... what to code and how to shape the data...

'This is my first-ever program in Python, so can't claim it as a very great code'


Wednesday, November 5, 2014

Quick analysis on new Mobile available in Market and Sentiments using Twitter api

Phone Analysis and Sentiment

Heard lots about iPhone6 and now Nexus 6 , so finally I decided to try my first ever twitter analytic test using R and the objective is to find how people are talking about these phones.

Well to honor Samsung , I added it too.

Few things to know before starting twitter analytic using R (not from expert point of view) -

1. Its Very easy and R will make your life ultra easy. Python is as well great for the same but I will post about that sometime later.

2. To do quick analytic there is no point of running behind BigData , simply use less records and do the job.

3. Number of records does change the fate of your analytic , so its always good to run the same on large data volume so do this step once your bigdata setup is done and you already have experience working on casandra , spark etc.

4. To learn this you just need to know any technology and you must love technology, then learning R and do Rstats won't be trouble.

5. Some basic libraries like 'twitteR' , 'plyr' , 'stringr' and 'ggplot2' for twitter and statistics, do some basic hands-on on them and then roll.

6. If you are good in mathematics , that wonderful but you need not to be best in that, just basic school mathematics concepts are enough for starter..atleast what I felt.

Once I ran the stats-

Great iPhone still leads in all +ve side and Nexus 6 based on data , actually not getting that positive vibe , well then why i heard that ...

Again all these stats can change a lot if I try the same with more data.

The source code with comments are already uploaded in my github RsourceFile, just explore ..

Happy Coding..

Monday, October 27, 2014

Some interesting things about Regression


Once we run the linear regression model, we get residuals as summary, so to determine more about residuals, there are some interesting points I accidentally found and it helped me so I am drafting ...

  • Residuals, have mean Zero so it does mean that residual is balanced among the data points , so no pattern its just scattered and there will almost equal positive and negative.

so if I run the linear regression in R-

fit <- lm ( relation ~ person , data = people) ,

so to justify the theory , just do the simple mean of residuals-

mean(fit$residuals) - must give a value very close to zero.

  • There is no correlation between residuals and predictors.

cov(fit$residuals, people$person)


While googling I found a new equation -

  • var(data) = var(estimate) + var(residuals)

Least Square

Regression line is the line through the data which has minimum (least) squared 'error', the vertical distance between actual predictor and the prediction made by line.
Squaring the distances ensures the data points above and below the line are treated the same.

The method of choosing the 'best regression line' (or fitting a line to the data) is known as ordinary least squares.

Thursday, October 16, 2014

Why did I try Residual Plot over my dataset

Why Residual Plot

Well I had previously a sick data of India Population, which was correct but I altered it and made it worst. So now the population based on year is linear increasing suddenly it touched the pinnacle so ideally this data is never meant for Linear Regression but bound with my habit , I ran linear regression on them and found this -

Ok above is the linear line I got and that is terrible , believe because I ran the predictor and I got brilliant bad result :(

For year 1800,2030,2040 I got 
     1         2         3 

-11839.78 824736.40 861109.28 

So it does man there was no India in map :O , what that's not possible I messed it up ...
Well i already mean it to make the data work properly , but nothing helped.

So Now I knew that I need to transform my data to some format so I searched on internet and found some keyword named Residual Plot.

Well, what again new concept, why should I learn this....

Residual is the error between an actual value of dependent variable and predicted value. So avoiding all these mind blowing keyword behind , I finally derived that its a way to find a model is a 'good fit' or not.

There is 2 very basic and easy thing to remember in residual plot-
 1. The residuals for the 'good' regression model are normally distributed, and random.

 2. The residuals for the 'bad' regression model are non-Normal, and have a distinct, non-random pattern.

So from above , we can see a sure-shot case of bad data and model and I know surely this model is bad as my model definitely shows a pattern, superb pattern of growing....

More , by chance I need more-

If I know this far, I must draw a conclusion by drawing some example with a good fit of data within model , lets see hows residual looks-

Following Sample data -

x <- runif(100,-3,3)

y <- x+ sin(x) + rnorm(100,sd =.2)

and I got -

Good one, isn't it , but lets not be in hurry, lets see the Residual plot-

What I can see a pattern now , I sin wave, ahh... so it says the model looks to be good but it not, so just don't go with scatter plot or model, there may be trouble inside, there is no harm to run the residual plot .

Monday, September 22, 2014

Linear Regression , What is that and when should I use it - Machine Learning

Linear Regression or Regression with Multiple Convariates 

Believe me these are extremely easy to understand and R-programming has already these algorithms implemented , you just need to know how to use them :)

Lets consider we have values X and Y. In Simple word Linear Regression is way to model a relationship between X and Y , that's all :-). Now we have X1...Xn and Y , then relationship between them is Multiple Linear Regression

Linear Regression is very widely used Machine Learning algorithm everywhere because Models which depend linearly on their unknown parameters are easier to fit.

Uses of Linear Regression ~

  • Prediction Analysis kind of applications can be done using Linear Regression , precisely after developing a Linear Regression Model, for any new value of X , we can predict the value of Y (based on the model developed with a previous set of data).

  • For a given Y, if we are provided with multiple X like X1.....Xn , then this technique can be used to find the relationship between each of the X with Y , so we can find the weakest relationship with Y and the best one as well .

Why I did all the theory above is , so that I could remember the basics, rest is all easy :).


So now I'd like to do an example in R and the best resource I could find was Population. 
Talk about population , so How can I miss India , so somehow I managed to get dataset-

Above is just a snapshot of the data , I had data from 1700 till 2014 and yeah some missing data as well in-between .

to use R , already caret package has an implementation of regression , so load the same and for plotting I am using ggplot.

The bottomline after getting data is to do the exploratory analysis, well I have 2 fields and no time :) , so just a quick plot-

Looking great , its growing ,.. growing ..and .. so its real data .
So 1st thing 1st , split the data in 2 parts , training and testing 

 allTrainData <- createDataPartition(y=data$population,p=0.7,list=FALSE)  
 training <- data[allTrainData,]  
 testing <- data[-allTrainData,]  

So now I have X and Y , or simply wanted to find the population based on year or vice versa .

Don't worry , R brought caret package which already brought implementation of the linear regression algorithm.
What the formula behind it , please check my other blog about detail of Linear Modelling but here -

 model <- train(population~.,method="lm",data=training)  
 finalModel <- model$finalModel  

1 line , that's all , method="lm" , isn't it extraordinary :)
So the summary here-

 lm(formula = .outcome ~ ., data = dat)  
   Min   1Q Median   3Q   Max   
 -186364 -164118 -83667 106876 811176   
       Estimate Std. Error t value Pr(>|t|)    
 (Intercept) -6516888   668533 -9.748 4.69e-16 ***  
 year      3616    346 10.451 < 2e-16 ***  
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  
 Residual standard error: 216300 on 97 degrees of freedom  
 Multiple R-squared: 0.5296,     Adjusted R-squared: 0.5248   
 F-statistic: 109.2 on 1 and 97 DF, p-value: < 2.2e-16  

Details about summary linear Regression Model Summary

So now lets plot the fitted vs residual graph and see how well the model worked.

Some weird :) but atleast the line went through almost all the data.

Now how well the model worked -

Well my data is anyway weird, so seriously it worked pretty good , believe me :)

So now on the model value , we should try the testing dataset and that's as well straightforward -

 pred <- predict(model,testing)  

disclaimer** - I am not a phd holder or data scientist , what I do is self interest and learning ... so it may contain some serious mistakes :) 

Monday, September 15, 2014

When should consider DataMining or Web Scrapping in BigData Ecosystem

DataMining\WebScrapping in Big Data Platform

What is Web Scrapping ?

Web has lot and lots of data and we do need sometime to research on some specific set of data but we cant assume all the data makes sense.
For example-
 <div class="island summary">   
  <ul class="iconed-list">   
  <li class="biz-hours iconed-list-item">   
   <div class="iconed-list-avatar">   
   <i class="i ig-biz_details i-clock-open-biz_details"></i>   
   <div class="iconed-list-story">   
   <span> Today <span class="hour-range"><span class="nowrap">10:00 am</span> - <span class="nowrap">10:00 pm</span></span> </span>   
   <span class="nowrap extra open">Open now</span>   
   </div> </li>   
  <li class="iconed-list-item claim-business">   
   <div class="iconed-list-avatar">   
   <i class="i ig-biz_details i-suitcase-red-star-biz_details"></i>   
   <div class="iconed-list-story">   
   <a href=""> <b>Work here?</b> Claim this business </a>   
   </div> </li>   

Above is just a script and it was loads of things and if you give me this set , I'll be like what the hell is this?
But if you concentrate on the data , you can derive few important info like it has Some Time written 10:00 am and some open now text so we can assume some Object opens at 10:00 am .

Well so you read an url and may get lots of data useful or useless but there may be some data which does make sense and to extract that data is what Web Scrapping.

Now Data Mining is like dig dig and keep digging those data to extract information which some time like puzzle.

Technologies Involved

Since I use Java , so I tried with Apache Nutch which a Scalable Web Crawler , so just extract data from nutch and dump the same in Apache Solr for fast Indexing.

Apache Nutch -
Apache Solr -

Both does the job quite well but still why BigData ????

Why BigData in Web Scrapping

Now consider a scenario where I needed to match the GPS coordinates of a particular destination available in the entire YellowPage websites.
So using simple DFS we can assume we can collect some 10000+ websites to have the record.
Now assume we have some extremely untidy data and we are trying to find a GPS coordinate on them .

Logically since we need to extract a pattern from all these websites to get the GPS coordinate(and then match) , now we need to run may be some very complex Regex and it will surely drain the entire memory.

So now just thinking logically , what if we try to run the entire process parallel ...

Bingo !!!!!!!!!!!!!!!!!  Big Data came to picture now.

I will first write a script to pull all data from all the urls and dump in HDFS or Spark and then the REGEX expression I will run over spark to get the quick result.

So if want to conclude , if we have incremental process and scrapping depth is higher, then its really helpful to use BigData setup to do the same .