Principles of Data Analysis
Principle 1 -
- Always try to find a hypothesis which can be compared with other hypothesis. So point comes here is "compared to what ? ".
Principle 2 -
- Needs to justify the cause of framework we derive
Principle 3 -
-Show Multivariate data or more that one dimension of data , to precisely justify your point . As data always speak and more and more data with proper representation will prove that point.
|Multivariate Data Representation|
So this kind of graphical representation speak about generalized weight of each of the feature parameter in neural network, so multivariate data representation does precisely show the representation.
Principle 4 -
- Integration of Evidence, It means always try to show almost all dimension of data not just limited. Add words , numbers , images or almost everything you have to show the data .
So main idea is its you who drive the data to represent, not the tool which just takes the data and plot something random.
Principle 5 -
-Sources where the data came from , so kind of evidence of your plot.
|Main Principles of Exploratory Data Analysis|
Why Exploratory Data Analysis
- Because they can be made easily and faster
- Can help for personal understanding
- Explain how data look like which is the main important for any Data
- Look and Feel is NOT the primary concern about exploratory data analysis
Example of Exploratory Data Analysis
I have my Bank account details and I needed to find out is there any time when I crossed more than 80000 from my Bank account transaction in terms of Debit .
First I'll be working on 1-Dimensional Plots like Box , Hist & Bar
So now I am assuming or trying to show my data as Categorical Variable.
So my data looks like above. First I wanted to plot a Boxplot on my Amount data to check the density of data with median and mean
data <- read.csv("D:/tmp/mlclass-ex1-005/mlclass-ex3-005/R-Studio/account.csv") head(data) #Replace the comma from Amount data$Amount <- as.numeric(gsub(",", "", gsub("", "", data$Amount)))
#Plotting the Box Plot boxplot(data$Amount , col = "yellow") #This corresponds that almost majority of my data is below 8000 , very less population around 80000 abline( h = 80000) #h = horizental
Well I realized that its useless for me as I am not even able to derive anything from this plot.
So now I decided to go ahead to some better way like to draw a histogram with rug and found out -
#Plotting Histogram of the data to get more details about the data hist(data$Amount , col = "light green" ) #rug shows exactly where the points are rug(data$Amount) #rug shows that bulk of the data is in between 25000 itself
Yeah this seems to better but I can make it more better by finding Median and Mean over that same graph and bring some more break points-
#Adding breaks add more number of plots or break the graphs in smaller parts hist(data$Amount , col = "light blue" ,breaks = 20) abline( v = 85000, lwd = 4) #v = vertical #Since hist doesn;t show median so that as well can be added here abline( v = median(data$Amount) , col = "red" , lwd = 4)
Great , now I have something to derive on my Data.
But how about trying the same with other feature like Transaction as I have binary value of it ie 1 for Debit or 0 for Credit.
Box Plot - a horrible mistake
Histogram , well it gives something but its not meant for this-
And finally here we find barplot is the best suit which actually can gives us the frequency of 0 or 1 , we can derive the occurrences on my credit account which is extremely poor . :(
#Trying to Plot histogram of Transaction by changing to numeric data$Transaction <- as.numeric(data$Transaction=="Dr.") #You won't get any details from Box Plot of Transaction because you have just 2 entries boxplot(data$Transaction , col = "yellow") #Here we can derive the frequency of 0 and 1 hist(data$Transaction , col = "light green") #Bar plot will be best suit for finding the majority of Debit and Credit from your account barplot(table(data$Transaction) , col = "light blue" , main = "Ration of Credit vs Debit ")
Now to work mainly on 2-Dimensional plots like Scatter-plot
Well Here I just wanted to find what is the density of Debit and Credit for my account Transaction and I was surprised to see its "Credit"
For the same purpose , Histogram Representation where its more cleared
|Amount vs Days spending and Trying to find Debit or Credit on top of it|
data <- read.csv("D:/tmp/mlclass-ex1-005/mlclass-ex3-005/R-Studio/account.csv") head(data) #Replace the comma from Amount data$Amount <- as.numeric(gsub(",", "", gsub("", "", data$Amount))) #Plotting 2-Dimensional Using Box Plot #Trying plot Amounts based on Credit or Debit boxplot(Amount ~ Transaction , data = data , col = "red") #Credit is More than Debit , hows that possible , that is because I am comparaing # based on the amount not with frequency , so amount speaks a total different chapter #Histogram representation for the same purpose hist(subset(data,Transaction == "Dr.")$Amount, col = "green" , breaks = 20) hist(subset(data,Transaction == "Cr.")$Amount, col = "green" , breaks = 20) #data$Transaction <- as.numeric(data$Transaction=="Dr.") #ScatterPlot #Split Date in Day Month & year data$Transaction.Date <- as.Date(data$Transaction.Date, format="%d/%m/%Y") month = as.numeric(format(data$Transaction.Date, format = "%m")) day = as.numeric(format(data$Transaction.Date, format = "%d")) year = as.numeric(format(data$Transaction.Date, format = "%Y")) #Shows the Amount transaction on Each day , and col sepeartes the value based on Dr. or Cr. type with(data,plot(day,Amount, col = Transaction)) #col will take Dr. as color red abline( h = 82000, lwd = 2, lty = 2, col = "green") #points(data$Amount,data$Transaction == "Dr.",col = "blue") #points(data$Amount,data$Transaction == "Cr.",col = "red") #Seperate the data based on year #with(data,plot(day,Amount, col = Year)) #MULTIPLE SCATTERPLOT like I Did in Histogram s1 <- subset(data,Transaction == "Dr.") s2 <- subset(data,Transaction == "Cr.") with(s1,plot(c(1:nrow(s1)),Amount,main = "Debit")) #day is not equal to size of x , so we need to find with(s2,plot(c(1:nrow(s2)),Amount,main = "Credit"))
So whenever the data comes, proper basic visualization is going to solve most of the trouble and that is what we did and that is called Exploratory Data Analysis.
We got the preview of data , and almost tried to follow the principles... is it .. have I actually followed all the principles....... :)
So its a kind of quick and dirty approach to summarize the data and it just ease our job to decide the model and strategy for the next step.