Data Engineer working with multiple Big Data technologies and Machine Learning: Understanding Linear Regression and Code in R

What is actually Regression?
Regression is the attempt to establish a mathematical relationship between variables.
It can be used to extrapolate or to predict the value of 1 variable from other given variables.

Like - By collection Flood data , we can calculate the prediction of flood each year .

so for the basic formula of Linear Relationship-

y = b + mx
where x is an independent variable
y is a dependent variable
m is the slope

And we are trying to predict the value of Y from given x

"Correlation" measures the dependability of the relationship (the goodness of fit of the data to the mathematical relationship). It is a measure of how well one variable can predict the other (given the context of the data), and determines the precision you can assign to a relationship

NOTE : Alternative names for independent variables (especially in data mining and predictive modeling) are input variables, predictors or features. Dependent variables are also called response variables, outcome variables, target variables or output variables. The terms "dependent" and "independent" here have no direct relation to the concept of statistical dependence or independence of events.

THE COFFICIENT
==============

The easiest thing to understand here is if X value is .65 standard deviation(a quantity expressing by how much the members of a group differ from the mean value for the group.) units from the mean, then its y pair should also be .65 standard deviatuon units from the mean if THERE IS A PERFECT FIT.

In a perfect fit a y value's position in its distribution will be in the same position as its corresponding x value' position in its distribution - a one to one correspondence

The correlation coefficient measures only the degree of linear association between two variables

LINEAR MODEL
============

The learning algorithm will learn the set of parameters such that
the sum of square error (yactual - yestimate)2 is minimized.

Here I tried multiple Linear Model (lm) operation on the set of "datasets" , and worked on the example of iris-

 plot(iris$Petal.Length,iris$Petal.Width, col = c("red","green","blue")[unclass(iris$Species)],pch = c(15,24,18)[unclass(iris$Species)] , main = "Petal Length vs Weight")  
 legend("topleft", pch = c(15,24,18),col = c("red","green","blue"),legend = c("setosa","versicolor","virginica"))  
 lm_petal <- lm(iris$Petal.Width~iris$Petal.Length);  
 abline(lm_petal$coefficients,col = "black")  
 summary(lm_petal)

on Running the above , I got my lm output. well now output was a big big mystery for me as I actually needed to understand what all those details data about , and since I am anyway working on data analysis , it became my responsibility to understand the output. So here my output looks like a devil to me .. I mean some nightmare mathematical buzzword ..

Plot of LM

Linear Model Output

So from "Plot of LM" its quite clear that our black straight line passed through almost all the data and linearly we did quite well, How can say that , because straight black line passed through each of the different data types and all the points are closure to those points and that helped a lot.

To answer that we need to understand some of extremely basic and super important points in Data Statistics, yeah I used the work Statistics because this is mathematics .

STANDARD DEVIATION - In simple straightforward way, Standard Deviation is the measure of how spread the numbers are.
Read more about the same Standard Deviation
"Lower value of Standard Deviation" means all the data points are close to mean and higher means all data points are scattered everywhere"

Now From the output , Residuals are nothing but "A statistical term used to describe the standard deviation of points formed around a linear function, and is an estimate of the accuracy of the dependent variable being measured. "

COEFFICIENTS:
Each coefficient in the model is a Gaussian (Normal) random variable. The

βi^ is the estimate of the mean of the distribution of that random variable, and the standard error is the square root of the variance of that distribution. It is a measure of the uncertainty in the estimate of the

βi

t-value : The

t statistics are the estimates (

βi^) divided by their standard errors (

σi^). Simple words t value is the value of the t-statistic for testing whether the corresponding regression coefficient is different from 0.

p-value : The p-value is an estimate of the probability of seeing a t-value as extreme, or more extreme the one you got, if you assume that the null hypothesis is true (the null hypothesis is usually "no effect", unless something else is specified). So if the p-value is very low, then there is a higher probability that you're seeing data that is counter-indicative of zero effect. In other situations, you can get a p-value based on other statistics and variables.

More on p-value - p-value

Residual Standard Error -The residual standard error is an estimate of the parameter

σ. The assumption in ordinary least squares is that the residuals are individually described by a Gaussian (normal) distribution with mean 0 and standard deviation

One more point the stars in output summary shows the significance of those features to calculate the result and if there is no star (*) , you can ignore.

That's enough, we got lot more details now lets move on to the actual picture, so from our output it can derive that our lm is close to datasets and it will give us proper result.

Now in the same set of data , I thought of running classification algorithm using lm , I mean I am trying to derive the result as classification of "isSetosa" .
For doing this 1st I divide my data into training set and test set, then I added a "setosa" check column in my traindata set and ran lm-

 testData <- which(1:length(iris[,1])%%5 == 0)  
 traindata <- iris[-testData,]  
 testdata <- iris[testData,]  
 lm_petal <- lm(Petal.Width ~ Petal.Length,data = traindata);  
 prediction <- predict(lm_petal,testdata,type='response')  
 round(prediction,3)  
 cor(prediction,testdata$Petal.Width);  
 #Trying to run classification algorithm in the iris example  
 #Add a New column to find Binary Classification  
 newCol <- data.frame(isSetosa = (traindata$Species == "setosa"))  
 #Appending the Column to the existing iris  
 iris_new <- cbind(traindata,newCol)  
 head(iris_new)  
 formula <- isSetosa ~ (Sepal.Length + Sepal.Width +Petal.Length+Petal.Width);  
 lm_new_out <- lm(formula,data = iris_new)  
 summary(lm_new_out)  
 lm_new_out <- predict(lm_new_out,testdata)  
 cor(prediction,testdata$Petal.Length);  
 cor(prediction,testdata$Petal.Width);  
 cor(prediction,testdata$Sepal.Length);  
 cor(prediction,testdata$Sepal.Width);  
 round(prediction_new,1)

So on running the same , I got this output -

So from the above output , we can see 3 start(*) and ie 1 * for Sepal.Length , 4 * for other2 and no star for Sepal.width.
To generalize the same I ran cor on the same features and Corelation as well gave me the same result.

and To find the rightness of the model I ran the prediction on the model and found Trained Output as a superb output.

Data Engineer working with multiple Big Data technologies and Machine Learning

Friday, May 30, 2014

Understanding Linear Regression and Code in R

No comments:

Python Java BigData Machine Learning Data Mining Developer