Data Engineer working with multiple Big Data technologies and Machine Learning: September 2014

Monday, September 22, 2014

Linear Regression , What is that and when should I use it - Machine Learning

Linear Regression or Regression with Multiple Convariates

Believe me these are extremely easy to understand and R-programming has already these algorithms implemented , you just need to know how to use them :)

Lets consider we have values X and Y. In Simple word Linear Regression is way to model a relationship between X and Y , that's all :-). Now we have X1...Xn and Y , then relationship between them is Multiple Linear Regression.

Linear Regression is very widely used Machine Learning algorithm everywhere because Models which depend linearly on their unknown parameters are easier to fit.

Uses of Linear Regression ~

Prediction Analysis kind of applications can be done using Linear Regression , precisely after developing a Linear Regression Model, for any new value of X , we can predict the value of Y (based on the model developed with a previous set of data).

For a given Y, if we are provided with multiple X like X1.....Xn , then this technique can be used to find the relationship between each of the X with Y , so we can find the weakest relationship with Y and the best one as well .

Why I did all the theory above is , so that I could remember the basics, rest is all easy :).

------------------------------------------------------------------

So now I'd like to do an example in R and the best resource I could find was Population.

Talk about population , so How can I miss India , so somehow I managed to get dataset-

Above is just a snapshot of the data , I had data from 1700 till 2014 and yeah some missing data as well in-between .

to use R , already caret package has an implementation of regression , so load the same and for plotting I am using ggplot.

The bottomline after getting data is to do the exploratory analysis, well I have 2 fields and no time :) , so just a quick plot-

Looking great , its growing ,.. growing ..and .. so its real data .

So 1st thing 1st , split the data in 2 parts , training and testing

 allTrainData <- createDataPartition(y=data$population,p=0.7,list=FALSE)  
 training <- data[allTrainData,]  
 testing <- data[-allTrainData,]

So now I have X and Y , or simply wanted to find the population based on year or vice versa .

Don't worry , R brought caret package which already brought implementation of the linear regression algorithm.
What the formula behind it , please check my other blog about detail of Linear Modelling but here -

 model <- train(population~.,method="lm",data=training)  
 finalModel <- model$finalModel

1 line , that's all , method="lm" , isn't it extraordinary :)
So the summary here-

 Call:  
 lm(formula = .outcome ~ ., data = dat)  
 Residuals:  
   Min   1Q Median   3Q   Max   
 -186364 -164118 -83667 106876 811176   
 Coefficients:  
       Estimate Std. Error t value Pr(>|t|)    
 (Intercept) -6516888   668533 -9.748 4.69e-16 ***  
 year      3616    346 10.451 < 2e-16 ***  
 ---  
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  
 Residual standard error: 216300 on 97 degrees of freedom  
 Multiple R-squared: 0.5296,     Adjusted R-squared: 0.5248   
 F-statistic: 109.2 on 1 and 97 DF, p-value: < 2.2e-16

Details about summary linear Regression Model Summary

So now lets plot the fitted vs residual graph and see how well the model worked.

Some weird :) but atleast the line went through almost all the data.

Now how well the model worked -

Well my data is anyway weird, so seriously it worked pretty good , believe me :)

So now on the model value , we should try the testing dataset and that's as well straightforward -

 pred <- predict(model,testing)

disclaimer** - I am not a phd holder or data scientist , what I do is self interest and learning ... so it may contain some serious mistakes :)

Monday, September 15, 2014

When should consider DataMining or Web Scrapping in BigData Ecosystem

DataMining\WebScrapping in Big Data Platform

What is Web Scrapping ?

Web has lot and lots of data and we do need sometime to research on some specific set of data but we cant assume all the data makes sense.

For example-

 <div class="island summary">   
  <ul class="iconed-list">   
  <li class="biz-hours iconed-list-item">   
   <div class="iconed-list-avatar">   
   <i class="i ig-biz_details i-clock-open-biz_details"></i>   
   </div>   
   <div class="iconed-list-story">   
   <span> Today <span class="hour-range"><span class="nowrap">10:00 am</span> - <span class="nowrap">10:00 pm</span></span> </span>   
   <span class="nowrap extra open">Open now</span>   
   </div> </li>   
  <li class="iconed-list-item claim-business">   
   <div class="iconed-list-avatar">   
   <i class="i ig-biz_details i-suitcase-red-star-biz_details"></i>   
   </div>   
   <div class="iconed-list-story">   
   <a href="https://biz.yelp.com/signup/Z_oAg2AmqZEtu5hlfPruNA/account"> <b>Work here?</b> Claim this business </a>   
   </div> </li>   
  </ul>   
 </div>

Above is just a script and it was loads of things and if you give me this set , I'll be like what the hell is this?
But if you concentrate on the data , you can derive few important info like it has Some Time written 10:00 am and some open now text so we can assume some Object opens at 10:00 am .

Well so you read an url and may get lots of data useful or useless but there may be some data which does make sense and to extract that data is what Web Scrapping.

Now Data Mining is like dig dig and keep digging those data to extract information which some time like puzzle.

Technologies Involved

Since I use Java , so I tried with Apache Nutch which a Scalable Web Crawler , so just extract data from nutch and dump the same in Apache Solr for fast Indexing.

Apache Nutch - http://nutch.apache.org/

Apache Solr - http://lucene.apache.org/solr/

Both does the job quite well but still why BigData ????

Why BigData in Web Scrapping

Now consider a scenario where I needed to match the GPS coordinates of a particular destination available in the entire YellowPage websites.

So using simple DFS we can assume we can collect some 10000+ websites to have the record.

Now assume we have some extremely untidy data and we are trying to find a GPS coordinate on them .

Logically since we need to extract a pattern from all these websites to get the GPS coordinate(and then match) , now we need to run may be some very complex Regex and it will surely drain the entire memory.

So now just thinking logically , what if we try to run the entire process parallel ...

Bingo !!!!!!!!!!!!!!!!! Big Data came to picture now.

I will first write a script to pull all data from all the urls and dump in HDFS or Spark and then the REGEX expression I will run over spark to get the quick result.

So if want to conclude , if we have incremental process and scrapping depth is higher, then its really helpful to use BigData setup to do the same .

......

Monday, September 8, 2014

What is Principle Components Analysis and Why Normalization

PCA or Principle Component Analysis is actually the way of finding variables which are similar and then extract the common data from the same variables so that data analysis can be done over less dimension but could result in better result.

Some more problems we can address using PCA -

Find a new set of multivariate variables that are uncorrelated and explain as much variance as possible.

If you put all the variables together in 1 matrix, find the best matrix created with fewer variables that explains the original data.

There is a big importance of Normalization before you do PCA -

'Normalization' is like if you have a large variance and other has small , PCA will be favored towards large variance. So if we have a variable in KM and if we increase the variance by converting it to CM , then PCA will start favoring the variable from No to 1st place.

http://stats.stackexchange.com/questions/69157/why-do-we-need-to-normalize-data-before-analysis

Friday, September 5, 2014

GPS Coordinate Analysis using R-Programming - Analytics

R-Programming GPS Analytics

Well I had a big data-set of coordinates and I have been asked to do analysis on it, I tried in Java , it did fairly well but I wanted something different , so I know what's the tool , so I went for R-Programming and I always love it :-).

So before doing anything, R already have enough libraries to use it, so little research and I concluded-

As Hmisc is one of my favourite because it has a large number of data anlysis and mathematical formulas already implemented which can be used to cleanse or segregate, data analysis or string manipulation and a series of more thing I just can't explain.

I had a huge dataset and I wanted to extract some important or required dataset from them, but since I liked query , so I used sqldf package .

And Finally, ggmap is an excellent library for ggmap allows for the easy visualization of spatial data and models on top of Google Maps, OpenStreetMaps, Stamen Maps, or CloudMade Maps using ggplot2.

Extracted data from csv file and then figured out longitude and latitude from the dataset.

 irdata = read.csv("D:/data_csv.csv", header = TRUE)  
 data = sqldf("select Longitude,Latitude from irdata where Latitude != 'N/A' and City == 'Mumbai' ")  
 data_long = as.numeric(levels(data$Longitude)[data$Longitude])  
 data_lat = as.numeric(levels(data$Latitude)[data$Latitude])

Now since I have more than some 5000 points , so you can imagine, if I try to draw that , my map will be look like some spider web, I can't analyse anything , so I just extract what I required , or small amount of data-

 someCoords1 <- data.frame(long=data_long[100:200], lat=data_lat[100:200])

Ok , now I have some data , so why not just try to find out the distance between each of these coordinates, we don't bring a huge data for comparison , I took just 6 and differences is in KM ,

 apply(someCoords1, 1, function(eachPoint) spDistsN1(as.matrix(someCoords1), eachPoint, longlat=TRUE))

If you don't like Kilometer , then use longlat = FALSE

      [,1]        [,2]        [,3]     [,4]        [,5]     [,6]  
 [1,] 0.000000 16.6289935 9.937742 44.73177 15.7613710 17.661536  
 [2,] 16.628993 0.0000000 14.142999 29.38614 0.8795789 3.794917  
 [3,] 9.937742 14.1429990 0.000000 37.64567 13.3679239 17.083060  
 [4,] 44.731771 29.3861396 37.645667 0.00000 30.0890677 30.811816  
 [5,] 15.761371 0.8795789 13.367924 30.08907 0.0000000 4.162316  
 [6,] 17.661536 3.7949174 17.083060 30.81182 4.1623161 0.000000

Now simply use get_map and pass the coordinates-

 mapgilbert <- get_map(location = c(lon = mean(mapdata$data_long), lat = mean(mapdata$data_lat)), zoom = 14,  
            maptype = "roadmap", source = "google")

So now time to draw it , so I used geom_point on the map.

 ggmap(mapgilbert) +  
  geom_point(data = mapdata, aes(x = data_long, y = data_lat, fill = "red", alpha = 0.8), size = 3, shape = 21)  
 + expand_limits(x = data_long, y = data_lat)+guides(fill=FALSE, alpha=FALSE, size=FALSE)

So now the output -

So isn't it amazing , well I found it great :-)

I tried some different kind of map as well for the same data-

Now we draw these , why not draw path between them-

 ggmap(mapgilbert) +   
  geom_path(aes(x = data_long, y = data_lat), data=mapdata ,alpha=0.2, size=1,color="yellow",lineend='round')+  
  geom_point(data = mapdata, aes(x = data_long, y = data_lat, fill = "red", alpha = 0.8), size = 3, shape = 21)

So now these are the output based on maptype-