Tuesday, November 25, 2014

Use Python for Recommendation Engine



Python For Recommendation Engine

well here Page1 we got the data and we fetched associated details with each of the dataset , alright now lets roll , will just write the algorithm and its done...

Good Morning .. its real world and nothing happens so easily here ..

Why did I say so ?

   Var1 Freq
    N/A  405

Because I have 405 NA values in imdbRating .. it does mean I have no records available for 405 among the list of 1173 total movies.. That sick because then almost 40% of the dataset is invalid.

So what should I do , should I simply dump the data and why it happened.

I can't dump that much data and it happened because there are movies which has no details in IMDB, may be because of some movies are based on particular zone only nt globally known or any other reason.


What Should I Do ?

I will cleanse and shape my data.

Lets consider a movie named KNHN and now I will try to find how many fb friends' like the movie lets say 10, now I can compare that number to the movie having highest viewer among my list ie 3 idiots which is 30.

So now I know among highest number 30% liked KNHN as well , so weightage of this movie is bit high. So naturally is above average rating ie 5.0 , so again I added 30% of highest rating 8.4 to 5.0 and finally I am close to some value around 7.0 (example).

So are we finished here , not possible , there are lot more fields to address like imdbUserVotes , rottenUserVotes. So what should I do here.

Simple I will use linear Regression or precisely Machine Learning ml to do this . Bingo !!!! I used scipy library of python to run polynomial regression-

  result = sm.ols(  
           formula="tomatoUserReviews ~ imdbRating + tomatoRating + I(genre1 ** 1.0) +I(genre2 ** 1.0)+I(genre3 ** 4.0)", data=df2).fit()  
         pretomatoRating = result.predict(  
           pd.DataFrame({'imdbRating': [newRating], 'tomatoRating': [newRating],'genre1':[0],'genre2':[0],'genre3':[0]}))  
         pretomatoRating = int(round(pretomatoRating))  
         if pretomatoRating < 0:  
           pretomatoRating = 10000  
         self.df2.loc[index, "tomatoUserReviews"] = pretomatoRating  


So now my dataset is quite preety




imdbRating imdbVotes imdbID tomatoRating tomatoUserReviews
1 6.20 89960 6.20 3504356
2 6.40 5108 tt2678948 6.30 152
3 6.80 35 tt0375174 6.30 125
4 7.20 99763 tt0087538 6.90 314496
5 6.90 231169 tt1068680 5.30 316060
6 7.70 291519 tt0289879 4.80 621210
7 6.90 174047 tt1401152 5.80 74879
8 7.20 1038 tt0452580 6.30 253
9 4.90 10381 6.30 6006
10 7.20 28 tt0297197 6.30 6006
11 4.90 10381 tt1278160 6.30 6006
12 4.90 10381 6.30 6006
13 4.90 10381 6.30 6006
14 3.20 690 tt2186731 6.30 862
15 6.20 5594 tt1252596 6.30 2436

So now all of my movies have Ratings and user reviews (almost).

So Now I can run my recommendation




Shaping The Final Outcome

Now I have all the required variables in the dataset I can finally run my distance algorithm to conclude which is the next best movie.

For rating as the data is consistent I used euclidean but votes are very sparse so I used consine distance to get the data.

   def calculatedistance(self, movie1, movie2):  
     FEATURES = [  
       'imdbRating', 'imdbVotes', 'tomatoRating', 'tomatoUserReviews']  
     movie1DF = df2[df2.MOVIE == movie1]  
     movie2DF = df2[df2.MOVIE == movie2]  
     rating1 = (float(movie1DF.iloc[0]['imdbRating']), float(  
       movie1DF.iloc[0]['tomatoRating']))  
     rating2 = (float(movie2DF.iloc[0]['imdbRating']), float(  
       movie2DF.iloc[0]['tomatoRating']))  
     review1 = (long(movie1DF.iloc[0]['imdbVotes']), long(  
       movie1DF.iloc[0]['tomatoUserReviews']))  
     review2 = (long(movie2DF.iloc[0]['imdbVotes']), long(  
       movie2DF.iloc[0]['tomatoUserReviews']))  
     # ValueError  
     distances = []  
     distances.append(round(distance.euclidean(rating1, rating2), 2))  
     '''  
                Since votes have sparse data , so i preffered to use cosine rather euclidean..  
                http://stats.stackexchange.com/questions/29627/euclidean-distance-is-usually-not-good-for-sparse-data  
           '''  
     distances.append(round(distance.cosine(review1, review2), 2))  
     return distances  


Python code has -

User Recommendation - recommendMovie
Distance Comparison between 2 movies - calculatedistance
Distance comparison between multiple movies - compareOneMovieWithMultiple
Distance Comparision Among Movies - findSimilarityBetweenMovies

and few more.

One of sample I have uploaded in cloud - Distance Calculated between Movies 

Entire Source Code - git link


Few More analytic in R will be uploaded.



What More Can be done ?

A Lot more like lots of movies I couldn't extract the Genre , so I can get that from using text extraction from twitter and get the best result.

Lots of basic ratings as well can be derived from Sentiment Analysis of the movie name in different crawler.

Recommendation Algorithm can be made more versatile .












Recommendation Engine using Python

What is Recommendation System


Well if you are reading this you anyway know what is recommendation but basically its giving a suggestion to the user based on certain Parameters like in Amazon , on buying books it gives you suggestion , in IMDB on checking any movie, it gives you recommended movies etc.



Now lets think how is that possible?

Answer - its Science or I must say Mathematics.

Let's play with some normal logic here-

Lets consider Movie 'The Godfather'

Now if somebody likes the same, can we recommend the same user any superhit comedy movie ... aah Naah... Not possible , so naturally we would like to recommend next best movie under Genre Crime or Drama or Thriller.
So thats a recommendation.

Hold on!... As a human I can search and find but hows is that possible using computing languages.

So now we got a point and I'd like to push you all in past , schooling days.
'Euclidean Distance or Manhattan Distance' ..

Are those words seems to be familier to you .. well at first sight it may be like "What d hell this guy is talking about" or what ... why should I know mathematics here.

Because its mathematics only :-)

The distance between two points in a grid based on a strictly horizontal and/or vertical path (that is, along the grid lines), as opposed to the diagonal or "as the crow flies" distance. The Manhattan distance is the simple sum of the horizontal and vertical components, whereas the diagonal distance might be computed by applying the Pythagorean theorem.

For more details , Google and know the actual concept behind them. That's important.


What Exactly I did with R and Python

I used R for do quick analytics on my data and used python to write the main algorithm behind it.
I must say everything I did in Python, I could have been done in R as well but I used python because I am little more comfortable in programming language rather than scientific one ie R. My personal opinion and please don't believe me :)


Concept behind the entire approach

If I have a collection of movies , then I can extract details about each of the movies like user ratings , reviews , Genre and all other things easily and if that is possible then I can use any distance algorithm like euclidean to find the similarity between any 2 movies.

Now I will pick the movies in each of my friends' list from facebook and then will compare the same with the list of unwatched movies and sort them based on distance , so now I have a list of movies recommendation start with the best till the end.


Is it actually so easy ????


Explanation based on Programming

First I connected to Facebook using Oauth and REST so that I could get access to my friends and then the movies liked by my friends.


 graph = facebook.GraphAPI(access_token)  
 profile = graph.get_object("me")  
 friends = graph.get_connections("me", 'friends')['data']  

Then I used algorithm to get movies like by my friend



 allLikes = graph.get_connections(friend['id'], "movies")['data']  


Now next job is to get the details about which of the movies like rating and all other things and finally I got following contents-




MOVIE Year Released Genre Director Poster imdbRating imdbVotes imdbID tomatoRating tomatoUserReviews BoxOffice
2 Jilla 2014 10/01/14 Action, Drama, Thriller R.T. Neason http://ia.media-imdb.com/images/M/MV5BOTUxNzExOTA0NF5BMl5BanBnXkFtZTgwMTUzNTAxMjE@._V1_SX300.jpg 6.4 5108 tt2678948 N/A 152 N/A
3 Pursuit of Happyness 2005 16/07/05 Documentary Patrick McGuinn http://ia.media-imdb.com/images/M/MV5BMTk4NjQ2NzI5Nl5BMl5BanBnXkFtZTcwOTIzNTM0MQ@@._V1_SX300.jpg 6.8 35 tt0375174 N/A 125 N/A
4 The Karate Kid 1984 22/06/84 Action, Drama, Family John G. Avildsen http://ia.media-imdb.com/images/M/MV5BMTkyNjE3MjM2MV5BMl5BanBnXkFtZTYwMzY5ODk4._V1_SX300.jpg 7.2 99763 tt0087538 6.9 314496 N/A
5 Yes Man 2008 19/12/08 Comedy, Romance Peyton Reed http://ia.media-imdb.com/images/M/MV5BNjYyOTkyMzg2OV5BMl5BanBnXkFtZTcwODAxNjk3MQ@@._V1_SX300.jpg 6.9 231169 tt1068680 5.3 316060 $97.6M
6 The Butterfly Effect 2004 23/01/04 Sci-Fi, Thriller Eric Bress, J. Mackye Gruber http://ia.media-imdb.com/images/M/MV5BMTI1ODkxNzg2N15BMl5BanBnXkFtZTYwMzQ2MTg2._V1_SX300.jpg 7.7 291519 tt0289879 4.8 621210 $57.7M
7 Unknown 2011 18/02/11 Action, Mystery, Thriller Jaume Collet-Serra http://ia.media-imdb.com/images/M/MV5BODA4NTk3MTQwN15BMl5BanBnXkFtZTcwNjUwMTMxNA@@._V1_SX300.jpg 6.9 174047 tt1401152 5.8 74879 $63.7M
8 A Year Ago in Winter 2008 06/01/10 Drama Caroline Link http://ia.media-imdb.com/images/M/MV5BMTQ4MTUzNTIwM15BMl5BanBnXkFtZTcwMTEzMjA0Mg@@._V1_SX300.jpg 7.2 1038 tt0452580 N/A 253 N/A
10 James Bond 007 1983 N/A Adventure, Animation, Action N/A N/A 7.2 28 tt0297197 N/A N/A N/A
14 Department 2012 18/05/12 Action Ram Gopal Varma N/A 3.2 690 tt2186731 N/A 862 N/A
15 Ajab Prem Ki Ghazab Kahani 2009 06/11/09 Comedy, Romance Rajkumar Santoshi http://ia.media-imdb.com/images/M/MV5BMjA0NjAwNzYxOV5BMl5BanBnXkFtZTcwNzA4NTk5Mw@@._V1_SX300.jpg 6.2 5594 tt1252596 N/A 2436 N/A



Entire code to get the data can be found facebook_movies_dataset

Since now I have data, so why not run some quick analytic on them and see have I really did any good job -




So from the graph above , we can easily see that 3 idiots is the most viewed movie among my FB movies list and Gaurav Shr.. has liked the most number of movies , so naturally he doesn't have any work :)


Source Movies_Analytics


So Now I have dataset which give me some values and and now I can work on my recommendation engine... if you have taken some breath .. lets move on to next important thing ... what to code and how to shape the data...

'This is my first-ever program in Python, so can't claim it as a very great code'

Part2





Wednesday, November 5, 2014

Quick analysis on new Mobile available in Market and Sentiments using Twitter api


Phone Analysis and Sentiment

Heard lots about iPhone6 and now Nexus 6 , so finally I decided to try my first ever twitter analytic test using R and the objective is to find how people are talking about these phones.

Well to honor Samsung , I added it too.

Few things to know before starting twitter analytic using R (not from expert point of view) -

1. Its Very easy and R will make your life ultra easy. Python is as well great for the same but I will post about that sometime later.

2. To do quick analytic there is no point of running behind BigData , simply use less records and do the job.

3. Number of records does change the fate of your analytic , so its always good to run the same on large data volume so do this step once your bigdata setup is done and you already have experience working on casandra , spark etc.

4. To learn this you just need to know any technology and you must love technology, then learning R and do Rstats won't be trouble.

5. Some basic libraries like 'twitteR' , 'plyr' , 'stringr' and 'ggplot2' for twitter and statistics, do some basic hands-on on them and then roll.

6. If you are good in mathematics , that wonderful but you need not to be best in that, just basic school mathematics concepts are enough for starter..atleast what I felt.

Once I ran the stats-



Great iPhone still leads in all +ve side and Nexus 6 based on data , actually not getting that positive vibe , well then why i heard that ...

Again all these stats can change a lot if I try the same with more data.


The source code with comments are already uploaded in my github RsourceFile, just explore ..

Happy Coding..





Monday, October 27, 2014

Some interesting things about Regression

Residuals


Once we run the linear regression model, we get residuals as summary, so to determine more about residuals, there are some interesting points I accidentally found and it helped me so I am drafting ...


  • Residuals, have mean Zero so it does mean that residual is balanced among the data points , so no pattern its just scattered and there will almost equal positive and negative.


so if I run the linear regression in R-

fit <- lm ( relation ~ person , data = people) ,

so to justify the theory , just do the simple mean of residuals-

mean(fit$residuals) - must give a value very close to zero.





  • There is no correlation between residuals and predictors.

cov(fit$residuals, people$person)

covariance


While googling I found a new equation -


  • var(data) = var(estimate) + var(residuals)


Least Square

Regression line is the line through the data which has minimum (least) squared 'error', the vertical distance between actual predictor and the prediction made by line.
Squaring the distances ensures the data points above and below the line are treated the same.

The method of choosing the 'best regression line' (or fitting a line to the data) is known as ordinary least squares.





Thursday, October 16, 2014

Why did I try Residual Plot over my dataset

Why Residual Plot


Well I had previously a sick data of India Population, which was correct but I altered it and made it worst. So now the population based on year is linear increasing suddenly it touched the pinnacle so ideally this data is never meant for Linear Regression but bound with my habit , I ran linear regression on them and found this -



Ok above is the linear line I got and that is terrible , believe because I ran the predictor and I got brilliant bad result :(

For year 1800,2030,2040 I got 
     1         2         3 

-11839.78 824736.40 861109.28 

So it does man there was no India in map :O , what that's not possible I messed it up ...
Well i already mean it to make the data work properly , but nothing helped.

So Now I knew that I need to transform my data to some format so I searched on internet and found some keyword named Residual Plot.


Well, what again new concept, why should I learn this....

Residual is the error between an actual value of dependent variable and predicted value. So avoiding all these mind blowing keyword behind , I finally derived that its a way to find a model is a 'good fit' or not.

There is 2 very basic and easy thing to remember in residual plot-
 1. The residuals for the 'good' regression model are normally distributed, and random.

 2. The residuals for the 'bad' regression model are non-Normal, and have a distinct, non-random pattern.




So from above , we can see a sure-shot case of bad data and model and I know surely this model is bad as my model definitely shows a pattern, superb pattern of growing....


More , by chance I need more-

If I know this far, I must draw a conclusion by drawing some example with a good fit of data within model , lets see hows residual looks-

Following Sample data -

x <- runif(100,-3,3)

y <- x+ sin(x) + rnorm(100,sd =.2)


and I got -


Good one, isn't it , but lets not be in hurry, lets see the Residual plot-



What I can see a pattern now , I sin wave, ahh... so it says the model looks to be good but it not, so just don't go with scatter plot or model, there may be trouble inside, there is no harm to run the residual plot .





Monday, September 22, 2014

Linear Regression , What is that and when should I use it - Machine Learning

Linear Regression or Regression with Multiple Convariates 

Believe me these are extremely easy to understand and R-programming has already these algorithms implemented , you just need to know how to use them :)

Lets consider we have values X and Y. In Simple word Linear Regression is way to model a relationship between X and Y , that's all :-). Now we have X1...Xn and Y , then relationship between them is Multiple Linear Regression

Linear Regression is very widely used Machine Learning algorithm everywhere because Models which depend linearly on their unknown parameters are easier to fit.


Uses of Linear Regression ~


  • Prediction Analysis kind of applications can be done using Linear Regression , precisely after developing a Linear Regression Model, for any new value of X , we can predict the value of Y (based on the model developed with a previous set of data).



  • For a given Y, if we are provided with multiple X like X1.....Xn , then this technique can be used to find the relationship between each of the X with Y , so we can find the weakest relationship with Y and the best one as well .

Why I did all the theory above is , so that I could remember the basics, rest is all easy :).


------------------------------------------------------------------

So now I'd like to do an example in R and the best resource I could find was Population. 
Talk about population , so How can I miss India , so somehow I managed to get dataset-


Above is just a snapshot of the data , I had data from 1700 till 2014 and yeah some missing data as well in-between .

to use R , already caret package has an implementation of regression , so load the same and for plotting I am using ggplot.

The bottomline after getting data is to do the exploratory analysis, well I have 2 fields and no time :) , so just a quick plot-


Looking great , its growing ,.. growing ..and .. so its real data .
So 1st thing 1st , split the data in 2 parts , training and testing 



 allTrainData <- createDataPartition(y=data$population,p=0.7,list=FALSE)  
 training <- data[allTrainData,]  
 testing <- data[-allTrainData,]  


So now I have X and Y , or simply wanted to find the population based on year or vice versa .

Don't worry , R brought caret package which already brought implementation of the linear regression algorithm.
What the formula behind it , please check my other blog about detail of Linear Modelling but here -


 model <- train(population~.,method="lm",data=training)  
 finalModel <- model$finalModel  

1 line , that's all , method="lm" , isn't it extraordinary :)
So the summary here-



 Call:  
 lm(formula = .outcome ~ ., data = dat)  
 Residuals:  
   Min   1Q Median   3Q   Max   
 -186364 -164118 -83667 106876 811176   
 Coefficients:  
       Estimate Std. Error t value Pr(>|t|)    
 (Intercept) -6516888   668533 -9.748 4.69e-16 ***  
 year      3616    346 10.451 < 2e-16 ***  
 ---  
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  
 Residual standard error: 216300 on 97 degrees of freedom  
 Multiple R-squared: 0.5296,     Adjusted R-squared: 0.5248   
 F-statistic: 109.2 on 1 and 97 DF, p-value: < 2.2e-16  

Details about summary linear Regression Model Summary

So now lets plot the fitted vs residual graph and see how well the model worked.






Some weird :) but atleast the line went through almost all the data.

Now how well the model worked -



Well my data is anyway weird, so seriously it worked pretty good , believe me :)




So now on the model value , we should try the testing dataset and that's as well straightforward -


 pred <- predict(model,testing)  


disclaimer** - I am not a phd holder or data scientist , what I do is self interest and learning ... so it may contain some serious mistakes :) 




Monday, September 15, 2014

When should consider DataMining or Web Scrapping in BigData Ecosystem

DataMining\WebScrapping in Big Data Platform


What is Web Scrapping ?

Web has lot and lots of data and we do need sometime to research on some specific set of data but we cant assume all the data makes sense.
For example-
 <div class="island summary">   
  <ul class="iconed-list">   
  <li class="biz-hours iconed-list-item">   
   <div class="iconed-list-avatar">   
   <i class="i ig-biz_details i-clock-open-biz_details"></i>   
   </div>   
   <div class="iconed-list-story">   
   <span> Today <span class="hour-range"><span class="nowrap">10:00 am</span> - <span class="nowrap">10:00 pm</span></span> </span>   
   <span class="nowrap extra open">Open now</span>   
   </div> </li>   
  <li class="iconed-list-item claim-business">   
   <div class="iconed-list-avatar">   
   <i class="i ig-biz_details i-suitcase-red-star-biz_details"></i>   
   </div>   
   <div class="iconed-list-story">   
   <a href="https://biz.yelp.com/signup/Z_oAg2AmqZEtu5hlfPruNA/account"> <b>Work here?</b> Claim this business </a>   
   </div> </li>   
  </ul>   
 </div>  

Above is just a script and it was loads of things and if you give me this set , I'll be like what the hell is this?
But if you concentrate on the data , you can derive few important info like it has Some Time written 10:00 am and some open now text so we can assume some Object opens at 10:00 am .



Well so you read an url and may get lots of data useful or useless but there may be some data which does make sense and to extract that data is what Web Scrapping.

Now Data Mining is like dig dig and keep digging those data to extract information which some time like puzzle.




Technologies Involved

Since I use Java , so I tried with Apache Nutch which a Scalable Web Crawler , so just extract data from nutch and dump the same in Apache Solr for fast Indexing.

Apache Nutch - http://nutch.apache.org/
Apache Solr - http://lucene.apache.org/solr/

Both does the job quite well but still why BigData ????


Why BigData in Web Scrapping


Now consider a scenario where I needed to match the GPS coordinates of a particular destination available in the entire YellowPage websites.
So using simple DFS we can assume we can collect some 10000+ websites to have the record.
Now assume we have some extremely untidy data and we are trying to find a GPS coordinate on them .

Logically since we need to extract a pattern from all these websites to get the GPS coordinate(and then match) , now we need to run may be some very complex Regex and it will surely drain the entire memory.

So now just thinking logically , what if we try to run the entire process parallel ...

Bingo !!!!!!!!!!!!!!!!!  Big Data came to picture now.

I will first write a script to pull all data from all the urls and dump in HDFS or Spark and then the REGEX expression I will run over spark to get the quick result.

So if want to conclude , if we have incremental process and scrapping depth is higher, then its really helpful to use BigData setup to do the same .
......


Monday, September 8, 2014

What is Principle Components Analysis and Why Normalization


PCA or Principle Component Analysis is actually the way of finding variables which are similar and then extract the common data from the same variables so that data analysis can be done over less dimension but could result in better result.

Some more problems we can address using PCA -


  • Find a new set of multivariate variables that are uncorrelated and explain as much variance as possible.



  • If you put all the variables together in 1 matrix, find the best matrix created with fewer variables that explains the original data.

There is a big importance of Normalization before you do PCA -

'Normalization' is like if you have a large variance and other has small , PCA will be favored towards large variance. So if we have a variable in KM and if we increase the variance by converting it to CM , then PCA will start favoring the variable from No to 1st place.

http://stats.stackexchange.com/questions/69157/why-do-we-need-to-normalize-data-before-analysis


Friday, September 5, 2014

GPS Coordinate Analysis using R-Programming - Analytics


R-Programming GPS Analytics


Well I had a big data-set of coordinates and I have been asked to do analysis on it, I tried in Java , it did fairly well but I wanted something different , so I know what's the tool , so I went for R-Programming and I always love it :-).


So before doing anything, R already have enough libraries to use it, so little research and I concluded-

As Hmisc is one of my favourite because it has a large number of data anlysis and mathematical formulas already implemented which can be used to cleanse or segregate, data analysis or string manipulation and a series of more thing I just can't explain.

I had a huge dataset and I wanted to extract some important or required dataset from them, but since I liked query , so I used sqldf package .

And Finally, ggmap is an excellent library for ggmap allows for the easy visualization of spatial data and models on top of Google Maps, OpenStreetMaps, Stamen Maps, or CloudMade Maps using ggplot2.


Extracted data from csv file and then figured out longitude and latitude from the dataset.

 irdata = read.csv("D:/data_csv.csv", header = TRUE)  
 data = sqldf("select Longitude,Latitude from irdata where Latitude != 'N/A' and City == 'Mumbai' ")  
 data_long = as.numeric(levels(data$Longitude)[data$Longitude])  
 data_lat = as.numeric(levels(data$Latitude)[data$Latitude])  


Now since I have more than some 5000 points , so you can imagine, if I try to draw that , my map will be look like some spider web, I can't analyse anything , so I just extract what I required , or small amount of data-



 someCoords1 <- data.frame(long=data_long[100:200], lat=data_lat[100:200])  

Ok , now I have some data , so why not just try to find out the distance between each of these coordinates, we don't bring a huge data for comparison , I took just 6 and differences is in KM , 



 apply(someCoords1, 1, function(eachPoint) spDistsN1(as.matrix(someCoords1), eachPoint, longlat=TRUE))  

If you don't like Kilometer , then use longlat = FALSE

      [,1]        [,2]        [,3]     [,4]        [,5]     [,6]  
 [1,] 0.000000 16.6289935 9.937742 44.73177 15.7613710 17.661536  
 [2,] 16.628993 0.0000000 14.142999 29.38614 0.8795789 3.794917  
 [3,] 9.937742 14.1429990 0.000000 37.64567 13.3679239 17.083060  
 [4,] 44.731771 29.3861396 37.645667 0.00000 30.0890677 30.811816  
 [5,] 15.761371 0.8795789 13.367924 30.08907 0.0000000 4.162316  
 [6,] 17.661536 3.7949174 17.083060 30.81182 4.1623161 0.000000  

Now simply use get_map and pass the coordinates-
 mapgilbert <- get_map(location = c(lon = mean(mapdata$data_long), lat = mean(mapdata$data_lat)), zoom = 14,  
            maptype = "roadmap", source = "google")  

So now time to draw it , so I used geom_point on the map.



 ggmap(mapgilbert) +  
  geom_point(data = mapdata, aes(x = data_long, y = data_lat, fill = "red", alpha = 0.8), size = 3, shape = 21)  
 + expand_limits(x = data_long, y = data_lat)+guides(fill=FALSE, alpha=FALSE, size=FALSE)  


So now the output -


So isn't it amazing , well I found it great :-)

I tried some different kind of map as well for the same data-

Now we draw these , why not draw path between them-



 ggmap(mapgilbert) +   
  geom_path(aes(x = data_long, y = data_lat), data=mapdata ,alpha=0.2, size=1,color="yellow",lineend='round')+  
  geom_point(data = mapdata, aes(x = data_long, y = data_lat, fill = "red", alpha = 0.8), size = 3, shape = 21)  

So now these are the output based on maptype-