Data Engineer working with multiple Big Data technologies and Machine Learning: Hadoop with apache Sqoop

Apache Sqoop

Apache Sqoop is a utility tool designed to efficiently transferring bulk data between Hadoop and other relational Database.

For Data analytics, we have enormous data and loading that data into Hadoop system and doing MR job on it every time is extremely hectic and inefficient. Users must consider details like ensuring consistency of data, the consumption of production system resources, data preparation for provisioning downstream pipeline. So few approaches we can think of-

· Using Scripts – Extremely inefficient, so drop it.

· Direct Access – It gives overhead on production server and even exposes the production system into risk of excessive load originating from the cluster.

This is exactly where Apache Sqoop comes handy.

· Sqoop allows import and export of data from Structured Database such RDBMS, NoSQL etc.

· Sqoop can fetch data from external system on to HDFS, and populate data in HIVE and HBase.

· Sqoop with Oozie can be used to schedule the import and export

What happens underneath the covers when you run Sqoop is very straightforward. The dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset. Each record of the data is handled in a type safe manner since Sqoop uses the database metadata to infer the data types.

Sqoop has 2 main versions-

· Sqoop 1

· Sqoop 2 – Compatible only with Hadoop 2

Installation:

Download the required version of scoop compatible with your Hadoop version –

Our case, the Apache Sqoop version is being taken from cloudera-

- Download the tar file.
- Extract the same in any directory of the file system.

- Setting SCOOP_HOME

1. Open the .bashrc of your user Home not root home
2. Our case ,home is /home/Hadoop
3. nano ~/.bashrc

Add the following kind of sample

Real output will be like –

- Sqoop Installation can be tested

Sqoop help

Problems while installation

Version Mismatch with Hadoop
Version Mismatch with Hive

No Connector available with Sqoop 1 as Sqoop 2 already provides most of the general use connectors

Basic Operations

Database Setup

- Create a Database in your mysql

- Create a Database named demoproject

- Go to main commandline and import the dump-

mysql –u –root –p –h {server ip} {database name} < {dump file path}

- Now again connect to sql and use the database

Use demoproject;

- To check the database, check

show Tables;

Sqoop Operations

Importing a table into Hadoop Hive

Sqoop import –connect jdbc:mysql://localhost/{databasename} –username {username} –password {password} –table {tablename} –hive-import

Column Name Mismatch error

If any SQL Column Data Type is not supported in Sqoop 1, then use the command-

--map-column-hive

sqoop import --connect jdbc:mysql://172.18.167.204/RTNG --username root --password root --table rt_order --hive-import --map-column-hive ORDER_DATA=string

Setting Number of Map Jobs

If default Map jobs are not fine with you or you manually want to set Map Jobs count, then –m {count} is the option. But number of more maps won’t make it sure that you process will be faster all the time.

sqoop import --connect jdbc:mysql://172.18.166.201/demoproject --username root --password root --table superbigtable -m 15 --hive-import

Create Sqoop Jobs

sqoop job --create myjob2 -- import --connect jdbc:mysql://172.18.167.204/RTNG
--username root --password root --table rt_t_rtp_proctask_user_bpmagent
--hive-import --incremental

append --check-column "lastupdated" --last-value "null" -m 1

Run the sqoop Job

sqoop job --exec myjob2

Data Engineer working with multiple Big Data technologies and Machine Learning

Monday, April 21, 2014

Hadoop with apache Sqoop