Data Engineer working with multiple Big Data technologies and Machine Learning: November 2014

Tuesday, November 25, 2014

Use Python for Recommendation Engine

Python For Recommendation Engine

well here Page1 we got the data and we fetched associated details with each of the dataset , alright now lets roll , will just write the algorithm and its done...

Good Morning .. its real world and nothing happens so easily here ..

Why did I say so ?

Var1 Freq

N/A 405

Because I have 405 NA values in imdbRating .. it does mean I have no records available for 405 among the list of 1173 total movies.. That sick because then almost 40% of the dataset is invalid.

So what should I do , should I simply dump the data and why it happened.

I can't dump that much data and it happened because there are movies which has no details in IMDB, may be because of some movies are based on particular zone only nt globally known or any other reason.

What Should I Do ?

I will cleanse and shape my data.

Lets consider a movie named KNHN and now I will try to find how many fb friends' like the movie lets say 10, now I can compare that number to the movie having highest viewer among my list ie 3 idiots which is 30.

So now I know among highest number 30% liked KNHN as well , so weightage of this movie is bit high. So naturally is above average rating ie 5.0 , so again I added 30% of highest rating 8.4 to 5.0 and finally I am close to some value around 7.0 (example).

So are we finished here , not possible , there are lot more fields to address like imdbUserVotes , rottenUserVotes. So what should I do here.

Simple I will use linear Regression or precisely Machine Learning ml to do this . Bingo !!!! I used scipy library of python to run polynomial regression-

  result = sm.ols(  
           formula="tomatoUserReviews ~ imdbRating + tomatoRating + I(genre1 ** 1.0) +I(genre2 ** 1.0)+I(genre3 ** 4.0)", data=df2).fit()  
         pretomatoRating = result.predict(  
           pd.DataFrame({'imdbRating': [newRating], 'tomatoRating': [newRating],'genre1':[0],'genre2':[0],'genre3':[0]}))  
         pretomatoRating = int(round(pretomatoRating))  
         if pretomatoRating < 0:  
           pretomatoRating = 10000  
         self.df2.loc[index, "tomatoUserReviews"] = pretomatoRating

So now my dataset is quite preety

	imdbRating	imdbVotes	imdbID	tomatoRating	tomatoUserReviews
1	6.20	89960		6.20	3504356
2	6.40	5108	tt2678948	6.30	152
3	6.80	35	tt0375174	6.30	125
4	7.20	99763	tt0087538	6.90	314496
5	6.90	231169	tt1068680	5.30	316060
6	7.70	291519	tt0289879	4.80	621210
7	6.90	174047	tt1401152	5.80	74879
8	7.20	1038	tt0452580	6.30	253
9	4.90	10381		6.30	6006
10	7.20	28	tt0297197	6.30	6006
11	4.90	10381	tt1278160	6.30	6006
12	4.90	10381		6.30	6006
13	4.90	10381		6.30	6006
14	3.20	690	tt2186731	6.30	862
15	6.20	5594	tt1252596	6.30	2436

So now all of my movies have Ratings and user reviews (almost).

So Now I can run my recommendation

Shaping The Final Outcome

Now I have all the required variables in the dataset I can finally run my distance algorithm to conclude which is the next best movie.

For rating as the data is consistent I used euclidean but votes are very sparse so I used consine distance to get the data.

   def calculatedistance(self, movie1, movie2):  
     FEATURES = [  
       'imdbRating', 'imdbVotes', 'tomatoRating', 'tomatoUserReviews']  
     movie1DF = df2[df2.MOVIE == movie1]  
     movie2DF = df2[df2.MOVIE == movie2]  
     rating1 = (float(movie1DF.iloc[0]['imdbRating']), float(  
       movie1DF.iloc[0]['tomatoRating']))  
     rating2 = (float(movie2DF.iloc[0]['imdbRating']), float(  
       movie2DF.iloc[0]['tomatoRating']))  
     review1 = (long(movie1DF.iloc[0]['imdbVotes']), long(  
       movie1DF.iloc[0]['tomatoUserReviews']))  
     review2 = (long(movie2DF.iloc[0]['imdbVotes']), long(  
       movie2DF.iloc[0]['tomatoUserReviews']))  
     # ValueError  
     distances = []  
     distances.append(round(distance.euclidean(rating1, rating2), 2))  
     '''  
                Since votes have sparse data , so i preffered to use cosine rather euclidean..  
                http://stats.stackexchange.com/questions/29627/euclidean-distance-is-usually-not-good-for-sparse-data  
           '''  
     distances.append(round(distance.cosine(review1, review2), 2))  
     return distances

Python code has -

User Recommendation - recommendMovie
Distance Comparison between 2 movies - calculatedistance
Distance comparison between multiple movies - compareOneMovieWithMultiple
Distance Comparision Among Movies - findSimilarityBetweenMovies

and few more.

One of sample I have uploaded in cloud - Distance Calculated between Movies

Entire Source Code - git link

Few More analytic in R will be uploaded.

What More Can be done ?

A Lot more like lots of movies I couldn't extract the Genre , so I can get that from using text extraction from twitter and get the best result.

Lots of basic ratings as well can be derived from Sentiment Analysis of the movie name in different crawler.

Recommendation Algorithm can be made more versatile .

Recommendation Engine using Python

What is Recommendation System

Well if you are reading this you anyway know what is recommendation but basically its giving a suggestion to the user based on certain Parameters like in Amazon , on buying books it gives you suggestion , in IMDB on checking any movie, it gives you recommended movies etc.

Now lets think how is that possible?

Answer - its Science or I must say Mathematics.

Let's play with some normal logic here-

Lets consider Movie 'The Godfather'

Now if somebody likes the same, can we recommend the same user any superhit comedy movie ... aah Naah... Not possible , so naturally we would like to recommend next best movie under Genre Crime or Drama or Thriller.

So thats a recommendation.

Hold on!... As a human I can search and find but hows is that possible using computing languages.

So now we got a point and I'd like to push you all in past , schooling days.

'Euclidean Distance or Manhattan Distance' ..

Are those words seems to be familier to you .. well at first sight it may be like "What d hell this guy is talking about" or what ... why should I know mathematics here.

Because its mathematics only :-)

The distance between two points in a grid based on a strictly horizontal and/or vertical path (that is, along the grid lines), as opposed to the diagonal or "as the crow flies" distance. The Manhattan distance is the simple sum of the horizontal and vertical components, whereas the diagonal distance might be computed by applying the Pythagorean theorem.

For more details , Google and know the actual concept behind them. That's important.

What Exactly I did with R and Python

I used R for do quick analytics on my data and used python to write the main algorithm behind it.

I must say everything I did in Python, I could have been done in R as well but I used python because I am little more comfortable in programming language rather than scientific one ie R. My personal opinion and please don't believe me :)

Concept behind the entire approach

If I have a collection of movies , then I can extract details about each of the movies like user ratings , reviews , Genre and all other things easily and if that is possible then I can use any distance algorithm like euclidean to find the similarity between any 2 movies.

Now I will pick the movies in each of my friends' list from facebook and then will compare the same with the list of unwatched movies and sort them based on distance , so now I have a list of movies recommendation start with the best till the end.

Is it actually so easy ????

Explanation based on Programming

First I connected to Facebook using Oauth and REST so that I could get access to my friends and then the movies liked by my friends.

 graph = facebook.GraphAPI(access_token)  
 profile = graph.get_object("me")  
 friends = graph.get_connections("me", 'friends')['data']

Then I used algorithm to get movies like by my friend

 allLikes = graph.get_connections(friend['id'], "movies")['data']

Now next job is to get the details about which of the movies like rating and all other things and finally I got following contents-

	MOVIE	Year	Released	Genre	Director	Poster	imdbRating	imdbVotes	imdbID	tomatoRating	tomatoUserReviews	BoxOffice
2	Jilla	2014	10/01/14	Action, Drama, Thriller	R.T. Neason	http://ia.media-imdb.com/images/M/MV5BOTUxNzExOTA0NF5BMl5BanBnXkFtZTgwMTUzNTAxMjE@._V1_SX300.jpg	6.4	5108	tt2678948	N/A	152	N/A
3	Pursuit of Happyness	2005	16/07/05	Documentary	Patrick McGuinn	http://ia.media-imdb.com/images/M/MV5BMTk4NjQ2NzI5Nl5BMl5BanBnXkFtZTcwOTIzNTM0MQ@@._V1_SX300.jpg	6.8	35	tt0375174	N/A	125	N/A
4	The Karate Kid	1984	22/06/84	Action, Drama, Family	John G. Avildsen	http://ia.media-imdb.com/images/M/MV5BMTkyNjE3MjM2MV5BMl5BanBnXkFtZTYwMzY5ODk4._V1_SX300.jpg	7.2	99763	tt0087538	6.9	314496	N/A
5	Yes Man	2008	19/12/08	Comedy, Romance	Peyton Reed	http://ia.media-imdb.com/images/M/MV5BNjYyOTkyMzg2OV5BMl5BanBnXkFtZTcwODAxNjk3MQ@@._V1_SX300.jpg	6.9	231169	tt1068680	5.3	316060	$97.6M
6	The Butterfly Effect	2004	23/01/04	Sci-Fi, Thriller	Eric Bress, J. Mackye Gruber	http://ia.media-imdb.com/images/M/MV5BMTI1ODkxNzg2N15BMl5BanBnXkFtZTYwMzQ2MTg2._V1_SX300.jpg	7.7	291519	tt0289879	4.8	621210	$57.7M
7	Unknown	2011	18/02/11	Action, Mystery, Thriller	Jaume Collet-Serra	http://ia.media-imdb.com/images/M/MV5BODA4NTk3MTQwN15BMl5BanBnXkFtZTcwNjUwMTMxNA@@._V1_SX300.jpg	6.9	174047	tt1401152	5.8	74879	$63.7M
8	A Year Ago in Winter	2008	06/01/10	Drama	Caroline Link	http://ia.media-imdb.com/images/M/MV5BMTQ4MTUzNTIwM15BMl5BanBnXkFtZTcwMTEzMjA0Mg@@._V1_SX300.jpg	7.2	1038	tt0452580	N/A	253	N/A
10	James Bond 007	1983	N/A	Adventure, Animation, Action	N/A	N/A	7.2	28	tt0297197	N/A	N/A	N/A
14	Department	2012	18/05/12	Action	Ram Gopal Varma	N/A	3.2	690	tt2186731	N/A	862	N/A
15	Ajab Prem Ki Ghazab Kahani	2009	06/11/09	Comedy, Romance	Rajkumar Santoshi	http://ia.media-imdb.com/images/M/MV5BMjA0NjAwNzYxOV5BMl5BanBnXkFtZTcwNzA4NTk5Mw@@._V1_SX300.jpg	6.2	5594	tt1252596	N/A	2436	N/A

Entire code to get the data can be found facebook_movies_dataset

Since now I have data, so why not run some quick analytic on them and see have I really did any good job -

So from the graph above , we can easily see that 3 idiots is the most viewed movie among my FB movies list and Gaurav Shr.. has liked the most number of movies , so naturally he doesn't have any work :)

Source Movies_Analytics

So Now I have dataset which give me some values and and now I can work on my recommendation engine... if you have taken some breath .. lets move on to next important thing ... what to code and how to shape the data...

'This is my first-ever program in Python, so can't claim it as a very great code'

Part2

Wednesday, November 5, 2014

Quick analysis on new Mobile available in Market and Sentiments using Twitter api

Phone Analysis and Sentiment

Heard lots about iPhone6 and now Nexus 6 , so finally I decided to try my first ever twitter analytic test using R and the objective is to find how people are talking about these phones.

Well to honor Samsung , I added it too.

Few things to know before starting twitter analytic using R (not from expert point of view) -

1. Its Very easy and R will make your life ultra easy. Python is as well great for the same but I will post about that sometime later.

2. To do quick analytic there is no point of running behind BigData , simply use less records and do the job.

3. Number of records does change the fate of your analytic , so its always good to run the same on large data volume so do this step once your bigdata setup is done and you already have experience working on casandra , spark etc.

4. To learn this you just need to know any technology and you must love technology, then learning R and do Rstats won't be trouble.

5. Some basic libraries like 'twitteR' , 'plyr' , 'stringr' and 'ggplot2' for twitter and statistics, do some basic hands-on on them and then roll.

6. If you are good in mathematics , that wonderful but you need not to be best in that, just basic school mathematics concepts are enough for starter..atleast what I felt.

Once I ran the stats-

Great iPhone still leads in all +ve side and Nexus 6 based on data , actually not getting that positive vibe , well then why i heard that ...

Again all these stats can change a lot if I try the same with more data.

The source code with comments are already uploaded in my github RsourceFile, just explore ..

Happy Coding..