Tuesday, November 25, 2014

Use Python for Recommendation Engine



Python For Recommendation Engine

well here Page1 we got the data and we fetched associated details with each of the dataset , alright now lets roll , will just write the algorithm and its done...

Good Morning .. its real world and nothing happens so easily here ..

Why did I say so ?

   Var1 Freq
    N/A  405

Because I have 405 NA values in imdbRating .. it does mean I have no records available for 405 among the list of 1173 total movies.. That sick because then almost 40% of the dataset is invalid.

So what should I do , should I simply dump the data and why it happened.

I can't dump that much data and it happened because there are movies which has no details in IMDB, may be because of some movies are based on particular zone only nt globally known or any other reason.


What Should I Do ?

I will cleanse and shape my data.

Lets consider a movie named KNHN and now I will try to find how many fb friends' like the movie lets say 10, now I can compare that number to the movie having highest viewer among my list ie 3 idiots which is 30.

So now I know among highest number 30% liked KNHN as well , so weightage of this movie is bit high. So naturally is above average rating ie 5.0 , so again I added 30% of highest rating 8.4 to 5.0 and finally I am close to some value around 7.0 (example).

So are we finished here , not possible , there are lot more fields to address like imdbUserVotes , rottenUserVotes. So what should I do here.

Simple I will use linear Regression or precisely Machine Learning ml to do this . Bingo !!!! I used scipy library of python to run polynomial regression-

  result = sm.ols(  
           formula="tomatoUserReviews ~ imdbRating + tomatoRating + I(genre1 ** 1.0) +I(genre2 ** 1.0)+I(genre3 ** 4.0)", data=df2).fit()  
         pretomatoRating = result.predict(  
           pd.DataFrame({'imdbRating': [newRating], 'tomatoRating': [newRating],'genre1':[0],'genre2':[0],'genre3':[0]}))  
         pretomatoRating = int(round(pretomatoRating))  
         if pretomatoRating < 0:  
           pretomatoRating = 10000  
         self.df2.loc[index, "tomatoUserReviews"] = pretomatoRating  


So now my dataset is quite preety




imdbRating imdbVotes imdbID tomatoRating tomatoUserReviews
1 6.20 89960 6.20 3504356
2 6.40 5108 tt2678948 6.30 152
3 6.80 35 tt0375174 6.30 125
4 7.20 99763 tt0087538 6.90 314496
5 6.90 231169 tt1068680 5.30 316060
6 7.70 291519 tt0289879 4.80 621210
7 6.90 174047 tt1401152 5.80 74879
8 7.20 1038 tt0452580 6.30 253
9 4.90 10381 6.30 6006
10 7.20 28 tt0297197 6.30 6006
11 4.90 10381 tt1278160 6.30 6006
12 4.90 10381 6.30 6006
13 4.90 10381 6.30 6006
14 3.20 690 tt2186731 6.30 862
15 6.20 5594 tt1252596 6.30 2436

So now all of my movies have Ratings and user reviews (almost).

So Now I can run my recommendation




Shaping The Final Outcome

Now I have all the required variables in the dataset I can finally run my distance algorithm to conclude which is the next best movie.

For rating as the data is consistent I used euclidean but votes are very sparse so I used consine distance to get the data.

   def calculatedistance(self, movie1, movie2):  
     FEATURES = [  
       'imdbRating', 'imdbVotes', 'tomatoRating', 'tomatoUserReviews']  
     movie1DF = df2[df2.MOVIE == movie1]  
     movie2DF = df2[df2.MOVIE == movie2]  
     rating1 = (float(movie1DF.iloc[0]['imdbRating']), float(  
       movie1DF.iloc[0]['tomatoRating']))  
     rating2 = (float(movie2DF.iloc[0]['imdbRating']), float(  
       movie2DF.iloc[0]['tomatoRating']))  
     review1 = (long(movie1DF.iloc[0]['imdbVotes']), long(  
       movie1DF.iloc[0]['tomatoUserReviews']))  
     review2 = (long(movie2DF.iloc[0]['imdbVotes']), long(  
       movie2DF.iloc[0]['tomatoUserReviews']))  
     # ValueError  
     distances = []  
     distances.append(round(distance.euclidean(rating1, rating2), 2))  
     '''  
                Since votes have sparse data , so i preffered to use cosine rather euclidean..  
                http://stats.stackexchange.com/questions/29627/euclidean-distance-is-usually-not-good-for-sparse-data  
           '''  
     distances.append(round(distance.cosine(review1, review2), 2))  
     return distances  


Python code has -

User Recommendation - recommendMovie
Distance Comparison between 2 movies - calculatedistance
Distance comparison between multiple movies - compareOneMovieWithMultiple
Distance Comparision Among Movies - findSimilarityBetweenMovies

and few more.

One of sample I have uploaded in cloud - Distance Calculated between Movies 

Entire Source Code - git link


Few More analytic in R will be uploaded.



What More Can be done ?

A Lot more like lots of movies I couldn't extract the Genre , so I can get that from using text extraction from twitter and get the best result.

Lots of basic ratings as well can be derived from Sentiment Analysis of the movie name in different crawler.

Recommendation Algorithm can be made more versatile .












Recommendation Engine using Python

What is Recommendation System


Well if you are reading this you anyway know what is recommendation but basically its giving a suggestion to the user based on certain Parameters like in Amazon , on buying books it gives you suggestion , in IMDB on checking any movie, it gives you recommended movies etc.



Now lets think how is that possible?

Answer - its Science or I must say Mathematics.

Let's play with some normal logic here-

Lets consider Movie 'The Godfather'

Now if somebody likes the same, can we recommend the same user any superhit comedy movie ... aah Naah... Not possible , so naturally we would like to recommend next best movie under Genre Crime or Drama or Thriller.
So thats a recommendation.

Hold on!... As a human I can search and find but hows is that possible using computing languages.

So now we got a point and I'd like to push you all in past , schooling days.
'Euclidean Distance or Manhattan Distance' ..

Are those words seems to be familier to you .. well at first sight it may be like "What d hell this guy is talking about" or what ... why should I know mathematics here.

Because its mathematics only :-)

The distance between two points in a grid based on a strictly horizontal and/or vertical path (that is, along the grid lines), as opposed to the diagonal or "as the crow flies" distance. The Manhattan distance is the simple sum of the horizontal and vertical components, whereas the diagonal distance might be computed by applying the Pythagorean theorem.

For more details , Google and know the actual concept behind them. That's important.


What Exactly I did with R and Python

I used R for do quick analytics on my data and used python to write the main algorithm behind it.
I must say everything I did in Python, I could have been done in R as well but I used python because I am little more comfortable in programming language rather than scientific one ie R. My personal opinion and please don't believe me :)


Concept behind the entire approach

If I have a collection of movies , then I can extract details about each of the movies like user ratings , reviews , Genre and all other things easily and if that is possible then I can use any distance algorithm like euclidean to find the similarity between any 2 movies.

Now I will pick the movies in each of my friends' list from facebook and then will compare the same with the list of unwatched movies and sort them based on distance , so now I have a list of movies recommendation start with the best till the end.


Is it actually so easy ????


Explanation based on Programming

First I connected to Facebook using Oauth and REST so that I could get access to my friends and then the movies liked by my friends.


 graph = facebook.GraphAPI(access_token)  
 profile = graph.get_object("me")  
 friends = graph.get_connections("me", 'friends')['data']  

Then I used algorithm to get movies like by my friend



 allLikes = graph.get_connections(friend['id'], "movies")['data']  


Now next job is to get the details about which of the movies like rating and all other things and finally I got following contents-




MOVIE Year Released Genre Director Poster imdbRating imdbVotes imdbID tomatoRating tomatoUserReviews BoxOffice
2 Jilla 2014 10/01/14 Action, Drama, Thriller R.T. Neason http://ia.media-imdb.com/images/M/MV5BOTUxNzExOTA0NF5BMl5BanBnXkFtZTgwMTUzNTAxMjE@._V1_SX300.jpg 6.4 5108 tt2678948 N/A 152 N/A
3 Pursuit of Happyness 2005 16/07/05 Documentary Patrick McGuinn http://ia.media-imdb.com/images/M/MV5BMTk4NjQ2NzI5Nl5BMl5BanBnXkFtZTcwOTIzNTM0MQ@@._V1_SX300.jpg 6.8 35 tt0375174 N/A 125 N/A
4 The Karate Kid 1984 22/06/84 Action, Drama, Family John G. Avildsen http://ia.media-imdb.com/images/M/MV5BMTkyNjE3MjM2MV5BMl5BanBnXkFtZTYwMzY5ODk4._V1_SX300.jpg 7.2 99763 tt0087538 6.9 314496 N/A
5 Yes Man 2008 19/12/08 Comedy, Romance Peyton Reed http://ia.media-imdb.com/images/M/MV5BNjYyOTkyMzg2OV5BMl5BanBnXkFtZTcwODAxNjk3MQ@@._V1_SX300.jpg 6.9 231169 tt1068680 5.3 316060 $97.6M
6 The Butterfly Effect 2004 23/01/04 Sci-Fi, Thriller Eric Bress, J. Mackye Gruber http://ia.media-imdb.com/images/M/MV5BMTI1ODkxNzg2N15BMl5BanBnXkFtZTYwMzQ2MTg2._V1_SX300.jpg 7.7 291519 tt0289879 4.8 621210 $57.7M
7 Unknown 2011 18/02/11 Action, Mystery, Thriller Jaume Collet-Serra http://ia.media-imdb.com/images/M/MV5BODA4NTk3MTQwN15BMl5BanBnXkFtZTcwNjUwMTMxNA@@._V1_SX300.jpg 6.9 174047 tt1401152 5.8 74879 $63.7M
8 A Year Ago in Winter 2008 06/01/10 Drama Caroline Link http://ia.media-imdb.com/images/M/MV5BMTQ4MTUzNTIwM15BMl5BanBnXkFtZTcwMTEzMjA0Mg@@._V1_SX300.jpg 7.2 1038 tt0452580 N/A 253 N/A
10 James Bond 007 1983 N/A Adventure, Animation, Action N/A N/A 7.2 28 tt0297197 N/A N/A N/A
14 Department 2012 18/05/12 Action Ram Gopal Varma N/A 3.2 690 tt2186731 N/A 862 N/A
15 Ajab Prem Ki Ghazab Kahani 2009 06/11/09 Comedy, Romance Rajkumar Santoshi http://ia.media-imdb.com/images/M/MV5BMjA0NjAwNzYxOV5BMl5BanBnXkFtZTcwNzA4NTk5Mw@@._V1_SX300.jpg 6.2 5594 tt1252596 N/A 2436 N/A



Entire code to get the data can be found facebook_movies_dataset

Since now I have data, so why not run some quick analytic on them and see have I really did any good job -




So from the graph above , we can easily see that 3 idiots is the most viewed movie among my FB movies list and Gaurav Shr.. has liked the most number of movies , so naturally he doesn't have any work :)


Source Movies_Analytics


So Now I have dataset which give me some values and and now I can work on my recommendation engine... if you have taken some breath .. lets move on to next important thing ... what to code and how to shape the data...

'This is my first-ever program in Python, so can't claim it as a very great code'

Part2





Wednesday, November 5, 2014

Quick analysis on new Mobile available in Market and Sentiments using Twitter api


Phone Analysis and Sentiment

Heard lots about iPhone6 and now Nexus 6 , so finally I decided to try my first ever twitter analytic test using R and the objective is to find how people are talking about these phones.

Well to honor Samsung , I added it too.

Few things to know before starting twitter analytic using R (not from expert point of view) -

1. Its Very easy and R will make your life ultra easy. Python is as well great for the same but I will post about that sometime later.

2. To do quick analytic there is no point of running behind BigData , simply use less records and do the job.

3. Number of records does change the fate of your analytic , so its always good to run the same on large data volume so do this step once your bigdata setup is done and you already have experience working on casandra , spark etc.

4. To learn this you just need to know any technology and you must love technology, then learning R and do Rstats won't be trouble.

5. Some basic libraries like 'twitteR' , 'plyr' , 'stringr' and 'ggplot2' for twitter and statistics, do some basic hands-on on them and then roll.

6. If you are good in mathematics , that wonderful but you need not to be best in that, just basic school mathematics concepts are enough for starter..atleast what I felt.

Once I ran the stats-



Great iPhone still leads in all +ve side and Nexus 6 based on data , actually not getting that positive vibe , well then why i heard that ...

Again all these stats can change a lot if I try the same with more data.


The source code with comments are already uploaded in my github RsourceFile, just explore ..

Happy Coding..