Friday, April 25, 2014

Apache Oozie - Hadoop Job Scheduler Problems & installation

Apache Oozie – A Hadoop job Scheduler

What is Apache Oozie

Apache oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Main task of Apache oozie is it can schedule jobs to run in different time or schedule and data availability.
But oozie is not limited to Hadoop jobs it as well supports Java MR jobs , streaming MR jobs , pig , Hive , Sqoop and some specific Java Jobs.

A very nice definition of Oozie , I grabbed from internet-
Oozie is systems for describing the workflow of a job, where that job may contain a set of map reduce jobs, pig scripts, fs operations etc and supports fork and joining of the data flow.
It doesn't however allow you to stream the input of one MR job as the input to another - the map-reduce action in oozie still requires an output format of some type, typically a File based on, so your output from job 1 will still be serialized via HDFS, before being processed by job 2.


If using Cloudera, then get the recommended version based on your Hadoop.
Tried with oozie-3.3.0-cdh4.2.2 provided in cloudera repository.
If we are trying to work directly with oozie setup without build not with cloudera provided, the only difference is that you need to build the oozie.
So to build the oozie (not Cloudera provided setup)

 $ cd oozie-3.3.2/bin  
 $ ./ -DskipTests  

·        Prepare the war once

 bin/ prepare-war  

·        As I was expecting to find the oozie output in console so ran the following command

 bin/ -extjs libext/  

  • ·        Now create the oozie schema ( using mysql not the default derby)

Before run the command, change conf/oozie-site.xml –


Then run the command-

 bin/ create -sqlfile oozie.sql -run  
  • ·        Update Hadoop access for oozie user in conf/oozie-site.xml 

  • ·

  •         Now turn on the oozie –
 bin/ start  

  • ·       To check the oozie status-

From command line-
 bin/oozie admin -oozie http://localhost:11000/oozie -status  

  • ·        Run Oozie example
 oozie job -oozie http://localhost:11000/oozie -config /home/training/tmp/examples/apps/map-reduce/ -run  

  • ·        Check the status of job in command line-
 oozie job -oozie http://localhost:11000/oozie -info 64-256532451321-oozie-tucu  

or you can check the status in http://localhost:11000/oozie

Problems during setup

Oozie impersonate user error-

Always set the oozie user name properly in conf/oozie-site.xml . you must be sure about the name of your oozie user else it will break.
Then main point,
Restart Hadoop name node, data node and all other slaves and then it should work.

Couldn’t load oozie service class

org.apache.oozie.service.ServiceException: E0103: Could not load service classes, Cannot create PoolableConnectionFactory

Run Command
 bin/ -extjs libext/  

MySql oozie connection refused

Then change oozie-site.xml

Wednesday, April 23, 2014

Machine Learning - Introduction

As I'm learning BigData technologies and then I found Big Data is a big tool for Machine Learning , now to justify that I must know what is Machine Learning and Technologies.
So here I researched through internet and made my notes , its a kind of Technical notes but not very tough to understand and seriously its Funnnnnnnnn !!!!!!!!!!!!! :-)

Machine Learning is heavily used in analyzing and understanding system. Like google search or Facebook photo tagging all are machine learning.
Database mining is one of the very big examples of Machine Learning.
Example – Web click, Biological

Other set of examples are like non-human programmed application where human interaction is minimal. Like Auto Pilot system where there is no human interaction but things are driving, that’s where learning comes.
Example – handwriting recognition, NLP, computer Vision, Facial Recognition

Self-Automated System – like Amazon and all which understands your need based on your pick.

What is Machine Learning

Chess program was the main example for the above definition.
So a computer Program said to be learning if it get improvises or learning getting increasing as the time passes.

So a Task T from below is considered as Task like if a user make an email Spam by clicking on Spam button , so task is classifying an email is a Spam or not

If above example shows graph X – axis = property Square feet && Y-Axis – Dollar in 1000

Then a customer has a land of 750 Square feet and he needs to know the price he must expect.
So the Straight red line is one kind of supervised algorithm which says the price come as 100*1000 USD but the same which curved line ie polynomial algorithm shows the value can be expected as 200. So this shows which supervised algorithm to choose and what is the expected value, so this is an example of machine learning.
The example as well shows one important point i.e. Regression Problem means predict continuous valued output.

Examples of Supervised Learning ;

Given a large dataset of medical records from patients suffering from heart disease, try to learn whether there might be different clusters of such patients for which we might tailor separate treatments.
(his can be addressed using an unsupervised learning, clustering, algorithm, in which we group patients into different clusters.)

Have a computer examine an audio clip of a piece of music, and classify whether or not there are vocals (i.e., a human voice singing) in that audio clip, or if it is a clip of only musical instruments (and no vocals).
(This can be addressed using supervised learning, in which we learn from a training set of audio clips which have been labeled as either having vocals or not)

Given genetic (DNA) data from a person, predict the odds of him/her developing diabetes over the next 10 years

(This can be addressed as a supervised learning, classification, problem, where we can learn from a labeled dataset comprising different people's genetic data, and labels telling us if they had developed diabetes.)

Supervised algorithm has few types; one of them is as follows-

From the above diagram, blue cross are Tumor Size and Red or malignant which is not healthy means a cancer risk.
So from the Supervising learning we can expect that for a certain value like the Pink arrow, there is a possibility of Malignant there or not.
The above scenario can send off 2 classifications like 1 value or 0 value, so we have a multiple options in hand, that is known as Classification.

The other representation of the above problem can be shown like below as well-

From the above diagram we can depict that we have some 5 features to work on and learning algorithm can answer based on 5 features. So this way supervised algorithm work on features with classifications.

So our learning algorithm must be able to infinite number of features and that’s what proper Learning Algorithm is.

Unsupervised Learning

This type of learning doesn’t have any datasets or precisely no output defined like supervised version (0 or 1). Here given a dataset, user expects “can you derive some structure from the data”.

So here each red circle belongs to a cluster, so this is known as Clustering Algorithm.

Google News uses Unsupervised Clustering Algorithm. If we open google news under a single topic it shows several hyperlinks and each open a different page. So we can consider each heading as a cluster of hyperlinks which open different links.

Hadoop HDFS use clustering algorithm to split data in each data nodes. When we pass data to HDFS, we don’t define which data derive what, we just ask HDFS to store them and ask can we expect anything from data.

Other examples are as following -

Examples of Unsupervised Learning
Given a large dataset of medical records from patients suffering from heart disease, try to learn whether there might be different clusters of such patients for which we might tailor separate treatements.
(This can be addressed using an unsupervised learning, clustering, algorithm, in which we group patients into different clusters)

Examine a large collection of emails that are known to be spam email, to discover if there are sub-types of spam mail.

(This can address using a clustering (unsupervised learning) algorithm, to cluster spam mail into sub-types.)

Consider a Group Discussion where more than one guys are shouting, so sound will be mess. So that is called cocktail party unsupervised algorithm where this mixed sounds can be separated. We are not about the data, we just have sounds and we need to extract result from it, so it’s a unsupervised learning algorithm.

Lot More ......

Monday, April 21, 2014

Hadoop with apache Sqoop

Apache Sqoop is a utility tool designed to efficiently transferring bulk data between Hadoop and other relational Database.

For Data analytics, we have enormous data and loading that data into Hadoop system and doing MR job on it every time is extremely hectic and inefficient.  Users must consider details like ensuring consistency of data, the consumption of production system resources, data preparation for provisioning downstream pipeline. So few approaches we can think of-
·        Using Scripts – Extremely inefficient, so drop it.
·        Direct Access – It gives overhead on production server and even exposes the production system into risk of excessive load originating from the cluster.
This is exactly where Apache Sqoop comes handy.
·        Sqoop allows import and export of data from Structured Database such RDBMS, NoSQL etc.
·        Sqoop can fetch data from external system on to HDFS, and populate data in HIVE and HBase.
·        Sqoop with Oozie can be used to schedule the import and export

What happens underneath the covers when you run Sqoop is very straightforward. The dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset. Each record of the data is handled in a type safe manner since Sqoop uses the database metadata to infer the data types.

Sqoop has 2 main versions-
·        Sqoop 1
·        Sqoop 2 – Compatible only with Hadoop 2


Download the required version of scoop compatible with your Hadoop version –
Our case, the Apache Sqoop version is being taken from cloudera-

  • -        Download the tar file.
  • -        Extract the same in any directory of the file system.

-        Setting SCOOP_HOME

  • 1.      Open the .bashrc of your user Home not root home
  • 2.      Our case ,home is /home/Hadoop
  • 3.      nano ~/.bashrc

Add the following kind of sample

Real output will be like –

-        Sqoop Installation can be tested
Sqoop help





Problems while installation

Version Mismatch with Hadoop
Version Mismatch with Hive
No Connector available with Sqoop 1 as Sqoop 2 already provides most of the general use connectors

Basic Operations

Database Setup

-        Create a Database in your mysql

-        Create a Database named demoproject
-        Go to main commandline and import the dump-

mysql –u –root –p –h {server ip} {database name} < {dump file path}

-        Now again connect to sql and use the database
Use demoproject;

-        To check the database, check
show Tables;

Sqoop Operations

Importing a table into Hadoop Hive

Sqoop import –connect jdbc:mysql://localhost/{databasename} –username {username} –password {password} –table {tablename} –hive-import

Column Name Mismatch error

If any SQL Column Data Type is not supported in Sqoop 1, then use the command-

sqoop import --connect  jdbc:mysql:// --username root --password root --table rt_order --hive-import --map-column-hive ORDER_DATA=string

Setting Number of Map Jobs

If default Map jobs are not fine with you or you manually want to set Map Jobs count, then –m {count} is the option. But number of more maps won’t make it sure that you process will be faster all the time.
sqoop import --connect jdbc:mysql:// --username root --password root --table superbigtable -m 15 --hive-import



Create Sqoop Jobs

sqoop job --create myjob2 -- import --connect jdbc:mysql://
--username root --password root --table rt_t_rtp_proctask_user_bpmagent
--hive-import --incremental
  append  --check-column "lastupdated" --last-value "null" -m 1

Run the sqoop Job

sqoop job --exec myjob2