Data Engineer working with multiple Big Data technologies and Machine Learning: Linear Regression , What is that and when should I use it

Linear Regression or Regression with Multiple Convariates

Believe me these are extremely easy to understand and R-programming has already these algorithms implemented , you just need to know how to use them :)

Lets consider we have values X and Y. In Simple word Linear Regression is way to model a relationship between X and Y , that's all :-). Now we have X1...Xn and Y , then relationship between them is Multiple Linear Regression.

Linear Regression is very widely used Machine Learning algorithm everywhere because Models which depend linearly on their unknown parameters are easier to fit.

Uses of Linear Regression ~

Prediction Analysis kind of applications can be done using Linear Regression , precisely after developing a Linear Regression Model, for any new value of X , we can predict the value of Y (based on the model developed with a previous set of data).

For a given Y, if we are provided with multiple X like X1.....Xn , then this technique can be used to find the relationship between each of the X with Y , so we can find the weakest relationship with Y and the best one as well .

Why I did all the theory above is , so that I could remember the basics, rest is all easy :).

------------------------------------------------------------------

So now I'd like to do an example in R and the best resource I could find was Population.

Talk about population , so How can I miss India , so somehow I managed to get dataset-

Above is just a snapshot of the data , I had data from 1700 till 2014 and yeah some missing data as well in-between .

to use R , already caret package has an implementation of regression , so load the same and for plotting I am using ggplot.

The bottomline after getting data is to do the exploratory analysis, well I have 2 fields and no time :) , so just a quick plot-

Looking great , its growing ,.. growing ..and .. so its real data .

So 1st thing 1st , split the data in 2 parts , training and testing

 allTrainData <- createDataPartition(y=data$population,p=0.7,list=FALSE)  
 training <- data[allTrainData,]  
 testing <- data[-allTrainData,]

So now I have X and Y , or simply wanted to find the population based on year or vice versa .

Don't worry , R brought caret package which already brought implementation of the linear regression algorithm.
What the formula behind it , please check my other blog about detail of Linear Modelling but here -

 model <- train(population~.,method="lm",data=training)  
 finalModel <- model$finalModel

1 line , that's all , method="lm" , isn't it extraordinary :)
So the summary here-

 Call:  
 lm(formula = .outcome ~ ., data = dat)  
 Residuals:  
   Min   1Q Median   3Q   Max   
 -186364 -164118 -83667 106876 811176   
 Coefficients:  
       Estimate Std. Error t value Pr(>|t|)    
 (Intercept) -6516888   668533 -9.748 4.69e-16 ***  
 year      3616    346 10.451 < 2e-16 ***  
 ---  
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  
 Residual standard error: 216300 on 97 degrees of freedom  
 Multiple R-squared: 0.5296,     Adjusted R-squared: 0.5248   
 F-statistic: 109.2 on 1 and 97 DF, p-value: < 2.2e-16

Details about summary linear Regression Model Summary

So now lets plot the fitted vs residual graph and see how well the model worked.

Some weird :) but atleast the line went through almost all the data.

Now how well the model worked -

Well my data is anyway weird, so seriously it worked pretty good , believe me :)

So now on the model value , we should try the testing dataset and that's as well straightforward -

 pred <- predict(model,testing)

disclaimer** - I am not a phd holder or data scientist , what I do is self interest and learning ... so it may contain some serious mistakes :)

Data Engineer working with multiple Big Data technologies and Machine Learning

Monday, September 22, 2014

Linear Regression , What is that and when should I use it - Machine Learning

No comments:

Python Java BigData Machine Learning Data Mining Developer