This blog is not explaining anything about Development on hadoop, it actually explains about what hadoop and its basic components.
Some general points on Hadoop-
-Huge data needed to be processed and with variety and that’s the main
thing about Hadoop
-Data stored in HDFS and HDFS corresponds to Cluster
-Process on these data happened by Map reduce and since data is already in
the cluster hadoop emphasizes on inplace processing
-Consider Hive and Pig , both have their own way to present code , like
query of latin statement, but both finally makes MR job and then that Run over
hadoop cluster.
-HBase is a rela-time database run on top of HDFS
-Impala is used for low-latency query and directly runs on HDFS
-Some other hadoop ecosystem based tools are Flume , Hue , Ooize , Mahout
etc.
HDFS
Consider we have text file of size 150 MB.
In HDFS (Hadoop Distributed File System) , these data
will be saved in blocks.
Mostly each block is of 64 MB, so our data will
distributed as follows-
Now each block will be saved in one of the nodes in the
cluster as follows-
In each node of the cluster, there is a daemon all the time
running is DATANODE.
When we need to know which blocks are handled by which
node, that information is being handled by a daemon known as NAMENODE.
Information stored in Name Node is Meta-Data.
Problem with this System
There can be several problems with this system, few major-
·
Network Disk Failure , in this case nodes won’t
be able to talk to each other
·
Disk failure of Data Node
·
Disk Failure on Name Node, and we can loose our
meta-data.
Data Redundancy on HDFS
So to sort out the above problems, Hadoop maintains 2 copies of each data node and whenever any
node fails , it automatically linked to the other copy , and Hadoop daemon is
being instructed to make up the system again.
Name Node Failure
If name node is failure, it becomes a single point of
failure and everything just stops and no meta data , no information and finally
no job on data.
To ensure the high-Availability of Name node, a copy of name
node is as well maintained which will ensures that name node will always be
there and even if active name node is gone, there must be backup of name node .
MAP-REDUCE
If we have huge data, to process on a data in single thread
may take long time and as well as we can run out of memory and those things can
be solved by map-reduce job.
Hadoop doesn’t run on all data at once, but Hadoop believes
in running on chunk of data parallel.
How Map-Reduce Actually Works
Consider a scenario where we need to make a map containing
the sum of amounts spend by a credit card in different different cities.
To do this work , we have set of Map jobs and set of reducer
jobs.
Now each mapper will take small amount of data to process and
calculate the city and corresponding value on it. By the end each mapper will
have pile of cards on them per city transaction.
So from above , blue 1 have highest chunk of NYC and MIAMI
record but no records with LA but that won’t be same case with blue 2.
That’s the job of Mapper.
Now Job of Reducer
starts
Now we have a set of reducers and we are going to assign
some cities to each of the reducers.
Reducer 1 – NYC
Reducer 2 – MIAMI , LA
Now Reducer will collect only required piles of data directly
from Mappers and that will be faster because Mapper already have a pile of data
with it , so no further operation will be required.
Only thing reducer need to do is add all the amounts on their
pile and that will give the total transaction happened in each city by Credit
Card.
Mappers are little
program which runs on little set of data and perform operation on it , known as
intermediate data.
Hadoop works based
on Key-value pair. After mapper operates on data , reducers get that data and
work on each set of data from Map and process it and finally give the result.
Hadoop takes care
of the Shuffle and Sort phase. You do not have to sort the keys in your reducer
code, you get them in already sorted order.
In
our case , Reducer has the key as City name and the process Reducer
JOB TRACKER
When we run a MR Job that is being submitted to Job Tracker.
Job Tracker is responsible for splitting the job in different mappers and
reducers depend upon the volume of data.
TASK TRACKER
Running the actual MR task actually on each node is
actually done by a daemon Task Tracker.
Task Tracker runs
on each of this node.
As Task Tracker runs on the same machine as the Data
Node. So Hadoop framework helps to run the task on the same machine where the
data resides and that help to reduce lot of network traffic.
So mapper works on the same data on the same machine but
sometime the task tracker in the same nodes to be processed may be busy so in
that case, the process will be handled by different node task tracker over
network and that sometimes makes things slower.
Things we can do with Map-Reduce
Data processing
Log Analyzing
Big data queries
Machine Learning
Data Mining
Web Crawling
Item Classification
Fraud Detection
I have already setup Hadoop and lot more components on top of it.Mesos,Spark,Hive,Sqoop,Worked on Talend,YARN,PIG.I am currently working on more than 10 TB of data and working on setup a Big Data Infrastructure for Adhoc reporting and Analytics.
No comments:
Post a Comment