This blog is not explaining anything about Development on hadoop, it actually explains about what hadoop and its basic components.

Some general points on Hadoop-

-Huge data needed to be processed and with variety and that’s the main thing about Hadoop
-Data stored in HDFS and HDFS corresponds to Cluster
-Process on these data happened by Map reduce and since data is already in the cluster hadoop emphasizes on inplace processing
-Consider Hive and Pig , both have their own way to present code , like query of latin statement, but both finally makes MR job and then that Run over hadoop cluster.
-HBase is a rela-time database run on top of HDFS
-Impala is used for low-latency query and directly runs on HDFS
-Some other hadoop ecosystem based tools are Flume , Hue , Ooize , Mahout etc.

HDFS

Consider we have text file of size 150 MB.

In HDFS (Hadoop Distributed File System) , these data will be saved in blocks.

Mostly each block is of 64 MB, so our data will distributed as follows-

Now each block will be saved in one of the nodes in the cluster as follows-

In each node of the cluster, there is a daemon all the time running is DATANODE.

When we need to know which blocks are handled by which node, that information is being handled by a daemon known as NAMENODE.

Information stored in Name Node is Meta-Data.

Problem with this System

There can be several problems with this system, few major-

· Network Disk Failure , in this case nodes won’t be able to talk to each other

· Disk failure of Data Node

· Disk Failure on Name Node, and we can loose our meta-data.

Data Redundancy on HDFS

So to sort out the above problems, Hadoop maintains 2 copies of each data node and whenever any node fails , it automatically linked to the other copy , and Hadoop daemon is being instructed to make up the system again.

Name Node Failure

If name node is failure, it becomes a single point of failure and everything just stops and no meta data , no information and finally no job on data.

To ensure the high-Availability of Name node, a copy of name node is as well maintained which will ensures that name node will always be there and even if active name node is gone, there must be backup of name node .

MAP-REDUCE

If we have huge data, to process on a data in single thread may take long time and as well as we can run out of memory and those things can be solved by map-reduce job.

Hadoop doesn’t run on all data at once, but Hadoop believes in running on chunk of data parallel.

How Map-Reduce Actually Works

Consider a scenario where we need to make a map containing the sum of amounts spend by a credit card in different different cities.

To do this work , we have set of Map jobs and set of reducer jobs.

Now each mapper will take small amount of data to process and calculate the city and corresponding value on it. By the end each mapper will have pile of cards on them per city transaction.

So from above , blue 1 have highest chunk of NYC and MIAMI record but no records with LA but that won’t be same case with blue 2.

That’s the job of Mapper.

Now Job of Reducer starts

Now we have a set of reducers and we are going to assign some cities to each of the reducers.

Reducer 1 – NYC

Reducer 2 – MIAMI , LA

Now Reducer will collect only required piles of data directly from Mappers and that will be faster because Mapper already have a pile of data with it , so no further operation will be required.

Only thing reducer need to do is add all the amounts on their pile and that will give the total transaction happened in each city by Credit Card.

Mappers are little program which runs on little set of data and perform operation on it , known as intermediate data.

Hadoop works based on Key-value pair. After mapper operates on data , reducers get that data and work on each set of data from Map and process it and finally give the result.

Hadoop takes care of the Shuffle and Sort phase. You do not have to sort the keys in your reducer code, you get them in already sorted order.

In our case , Reducer has the key as City name and the process Reducer

JOB TRACKER

When we run a MR Job that is being submitted to Job Tracker. Job Tracker is responsible for splitting the job in different mappers and reducers depend upon the volume of data.

TASK TRACKER

Running the actual MR task actually on each node is actually done by a daemon Task Tracker.

Task Tracker runs on each of this node.

As Task Tracker runs on the same machine as the Data Node. So Hadoop framework helps to run the task on the same machine where the data resides and that help to reduce lot of network traffic.

So mapper works on the same data on the same machine but sometime the task tracker in the same nodes to be processed may be busy so in that case, the process will be handled by different node task tracker over network and that sometimes makes things slower.

Things we can do with Map-Reduce

Data processing

Log Analyzing

Big data queries

Machine Learning

Data Mining

Web Crawling

Item Classification

Fraud Detection

I have already setup Hadoop and lot more components on top of it.Mesos,Spark,Hive,Sqoop,Worked on Talend,YARN,PIG.I am currently working on more than 10 TB of data and working on setup a Big Data Infrastructure for Adhoc reporting and Analytics.

Data Engineer working with multiple Big Data technologies and Machine Learning

Monday, May 5, 2014

Hadoop Basics - A little behind Hadoop