Python For Recommendation Engine
well here Page1 we got the data and we fetched associated details with each of the dataset , alright now lets roll , will just write the algorithm and its done...
Good Morning .. its real world and nothing happens so easily here ..
Why did I say so ?
Var1 Freq
N/A 405
Because I have 405 NA values in imdbRating .. it does mean I have no records available for 405 among the list of 1173 total movies.. That sick because then almost 40% of the dataset is invalid.
So what should I do , should I simply dump the data and why it happened.
I can't dump that much data and it happened because there are movies which has no details in IMDB, may be because of some movies are based on particular zone only nt globally known or any other reason.
What Should I Do ?
I will cleanse and shape my data.
Lets consider a movie named KNHN and now I will try to find how many fb friends' like the movie lets say 10, now I can compare that number to the movie having highest viewer among my list ie 3 idiots which is 30.
So now I know among highest number 30% liked KNHN as well , so weightage of this movie is bit high. So naturally is above average rating ie 5.0 , so again I added 30% of highest rating 8.4 to 5.0 and finally I am close to some value around 7.0 (example).
So are we finished here , not possible , there are lot more fields to address like imdbUserVotes , rottenUserVotes. So what should I do here.
Simple I will use linear Regression or precisely Machine Learning ml to do this . Bingo !!!! I used scipy library of python to run polynomial regression-
result = sm.ols(
formula="tomatoUserReviews ~ imdbRating + tomatoRating + I(genre1 ** 1.0) +I(genre2 ** 1.0)+I(genre3 ** 4.0)", data=df2).fit()
pretomatoRating = result.predict(
pd.DataFrame({'imdbRating': [newRating], 'tomatoRating': [newRating],'genre1':[0],'genre2':[0],'genre3':[0]}))
pretomatoRating = int(round(pretomatoRating))
if pretomatoRating < 0:
pretomatoRating = 10000
self.df2.loc[index, "tomatoUserReviews"] = pretomatoRating
So now my dataset is quite preety
imdbRating | imdbVotes | imdbID | tomatoRating | tomatoUserReviews | |
---|---|---|---|---|---|
1 | 6.20 | 89960 | 6.20 | 3504356 | |
2 | 6.40 | 5108 | tt2678948 | 6.30 | 152 |
3 | 6.80 | 35 | tt0375174 | 6.30 | 125 |
4 | 7.20 | 99763 | tt0087538 | 6.90 | 314496 |
5 | 6.90 | 231169 | tt1068680 | 5.30 | 316060 |
6 | 7.70 | 291519 | tt0289879 | 4.80 | 621210 |
7 | 6.90 | 174047 | tt1401152 | 5.80 | 74879 |
8 | 7.20 | 1038 | tt0452580 | 6.30 | 253 |
9 | 4.90 | 10381 | 6.30 | 6006 | |
10 | 7.20 | 28 | tt0297197 | 6.30 | 6006 |
11 | 4.90 | 10381 | tt1278160 | 6.30 | 6006 |
12 | 4.90 | 10381 | 6.30 | 6006 | |
13 | 4.90 | 10381 | 6.30 | 6006 | |
14 | 3.20 | 690 | tt2186731 | 6.30 | 862 |
15 | 6.20 | 5594 | tt1252596 | 6.30 | 2436 |
So now all of my movies have Ratings and user reviews (almost).
So Now I can run my recommendation
Python code has -
User Recommendation - recommendMovie
Distance Comparison between 2 movies - calculatedistance
Distance comparison between multiple movies - compareOneMovieWithMultiple
Distance Comparision Among Movies - findSimilarityBetweenMovies
and few more.
One of sample I have uploaded in cloud - Distance Calculated between Movies
Entire Source Code - git link
Few More analytic in R will be uploaded.
Lots of basic ratings as well can be derived from Sentiment Analysis of the movie name in different crawler.
So Now I can run my recommendation
Shaping The Final Outcome
Now I have all the required variables in the dataset I can finally run my distance algorithm to conclude which is the next best movie.
For rating as the data is consistent I used euclidean but votes are very sparse so I used consine distance to get the data.
def calculatedistance(self, movie1, movie2):
FEATURES = [
'imdbRating', 'imdbVotes', 'tomatoRating', 'tomatoUserReviews']
movie1DF = df2[df2.MOVIE == movie1]
movie2DF = df2[df2.MOVIE == movie2]
rating1 = (float(movie1DF.iloc[0]['imdbRating']), float(
movie1DF.iloc[0]['tomatoRating']))
rating2 = (float(movie2DF.iloc[0]['imdbRating']), float(
movie2DF.iloc[0]['tomatoRating']))
review1 = (long(movie1DF.iloc[0]['imdbVotes']), long(
movie1DF.iloc[0]['tomatoUserReviews']))
review2 = (long(movie2DF.iloc[0]['imdbVotes']), long(
movie2DF.iloc[0]['tomatoUserReviews']))
# ValueError
distances = []
distances.append(round(distance.euclidean(rating1, rating2), 2))
'''
Since votes have sparse data , so i preffered to use cosine rather euclidean..
http://stats.stackexchange.com/questions/29627/euclidean-distance-is-usually-not-good-for-sparse-data
'''
distances.append(round(distance.cosine(review1, review2), 2))
return distances
User Recommendation - recommendMovie
Distance Comparison between 2 movies - calculatedistance
Distance comparison between multiple movies - compareOneMovieWithMultiple
Distance Comparision Among Movies - findSimilarityBetweenMovies
and few more.
One of sample I have uploaded in cloud - Distance Calculated between Movies
Entire Source Code - git link
Few More analytic in R will be uploaded.
What More Can be done ?
A Lot more like lots of movies I couldn't extract the Genre , so I can get that from using text extraction from twitter and get the best result.
Lots of basic ratings as well can be derived from Sentiment Analysis of the movie name in different crawler.
Recommendation Algorithm can be made more versatile .
1 comment:
Great blog. I have a question
1: Your first model states:
model <- lm(imdbVotes ~ imdbRating + tomatoRating + tomatoUserReviews+ I(genre1 ** 3.0) +I(genre2 ** 2.0)+I(genre3 ** 1.0), data = movies)
How can you run a regression model which includes NAs included)?
Br
Post a Comment