Data Engineer working with multiple Big Data technologies and Machine Learning: Use Python for Recommendation Engine

Python For Recommendation Engine

well here Page1 we got the data and we fetched associated details with each of the dataset , alright now lets roll , will just write the algorithm and its done...

Good Morning .. its real world and nothing happens so easily here ..

Why did I say so ?

Var1 Freq

N/A 405

Because I have 405 NA values in imdbRating .. it does mean I have no records available for 405 among the list of 1173 total movies.. That sick because then almost 40% of the dataset is invalid.

So what should I do , should I simply dump the data and why it happened.

I can't dump that much data and it happened because there are movies which has no details in IMDB, may be because of some movies are based on particular zone only nt globally known or any other reason.

What Should I Do ?

I will cleanse and shape my data.

Lets consider a movie named KNHN and now I will try to find how many fb friends' like the movie lets say 10, now I can compare that number to the movie having highest viewer among my list ie 3 idiots which is 30.

So now I know among highest number 30% liked KNHN as well , so weightage of this movie is bit high. So naturally is above average rating ie 5.0 , so again I added 30% of highest rating 8.4 to 5.0 and finally I am close to some value around 7.0 (example).

So are we finished here , not possible , there are lot more fields to address like imdbUserVotes , rottenUserVotes. So what should I do here.

Simple I will use linear Regression or precisely Machine Learning ml to do this . Bingo !!!! I used scipy library of python to run polynomial regression-

  result = sm.ols(  
           formula="tomatoUserReviews ~ imdbRating + tomatoRating + I(genre1 ** 1.0) +I(genre2 ** 1.0)+I(genre3 ** 4.0)", data=df2).fit()  
         pretomatoRating = result.predict(  
           pd.DataFrame({'imdbRating': [newRating], 'tomatoRating': [newRating],'genre1':[0],'genre2':[0],'genre3':[0]}))  
         pretomatoRating = int(round(pretomatoRating))  
         if pretomatoRating < 0:  
           pretomatoRating = 10000  
         self.df2.loc[index, "tomatoUserReviews"] = pretomatoRating

So now my dataset is quite preety

	imdbRating	imdbVotes	imdbID	tomatoRating	tomatoUserReviews
1	6.20	89960		6.20	3504356
2	6.40	5108	tt2678948	6.30	152
3	6.80	35	tt0375174	6.30	125
4	7.20	99763	tt0087538	6.90	314496
5	6.90	231169	tt1068680	5.30	316060
6	7.70	291519	tt0289879	4.80	621210
7	6.90	174047	tt1401152	5.80	74879
8	7.20	1038	tt0452580	6.30	253
9	4.90	10381		6.30	6006
10	7.20	28	tt0297197	6.30	6006
11	4.90	10381	tt1278160	6.30	6006
12	4.90	10381		6.30	6006
13	4.90	10381		6.30	6006
14	3.20	690	tt2186731	6.30	862
15	6.20	5594	tt1252596	6.30	2436

So now all of my movies have Ratings and user reviews (almost).

So Now I can run my recommendation

Shaping The Final Outcome

Now I have all the required variables in the dataset I can finally run my distance algorithm to conclude which is the next best movie.

For rating as the data is consistent I used euclidean but votes are very sparse so I used consine distance to get the data.

   def calculatedistance(self, movie1, movie2):  
     FEATURES = [  
       'imdbRating', 'imdbVotes', 'tomatoRating', 'tomatoUserReviews']  
     movie1DF = df2[df2.MOVIE == movie1]  
     movie2DF = df2[df2.MOVIE == movie2]  
     rating1 = (float(movie1DF.iloc[0]['imdbRating']), float(  
       movie1DF.iloc[0]['tomatoRating']))  
     rating2 = (float(movie2DF.iloc[0]['imdbRating']), float(  
       movie2DF.iloc[0]['tomatoRating']))  
     review1 = (long(movie1DF.iloc[0]['imdbVotes']), long(  
       movie1DF.iloc[0]['tomatoUserReviews']))  
     review2 = (long(movie2DF.iloc[0]['imdbVotes']), long(  
       movie2DF.iloc[0]['tomatoUserReviews']))  
     # ValueError  
     distances = []  
     distances.append(round(distance.euclidean(rating1, rating2), 2))  
     '''  
                Since votes have sparse data , so i preffered to use cosine rather euclidean..  
                http://stats.stackexchange.com/questions/29627/euclidean-distance-is-usually-not-good-for-sparse-data  
           '''  
     distances.append(round(distance.cosine(review1, review2), 2))  
     return distances

Python code has -

User Recommendation - recommendMovie
Distance Comparison between 2 movies - calculatedistance
Distance comparison between multiple movies - compareOneMovieWithMultiple
Distance Comparision Among Movies - findSimilarityBetweenMovies

and few more.

One of sample I have uploaded in cloud - Distance Calculated between Movies

Entire Source Code - git link

Few More analytic in R will be uploaded.

What More Can be done ?

A Lot more like lots of movies I couldn't extract the Genre , so I can get that from using text extraction from twitter and get the best result.

Lots of basic ratings as well can be derived from Sentiment Analysis of the movie name in different crawler.

Recommendation Algorithm can be made more versatile .

1 comment:

Boro Dega said...: Great blog. I have a question

1: Your first model states:
model <- lm(imdbVotes ~ imdbRating + tomatoRating + tomatoUserReviews+ I(genre1 ** 3.0) +I(genre2 ** 2.0)+I(genre3 ** 1.0), data = movies)

How can you run a regression model which includes NAs included)?

Br; April 12, 2016 at 11:36 PM

Data Engineer working with multiple Big Data technologies and Machine Learning

Tuesday, November 25, 2014

Use Python for Recommendation Engine