Apache Sqoop is a utility tool designed to efficiently
transferring bulk data between Hadoop and other relational Database.
For Data analytics, we have enormous data and loading that
data into Hadoop system and doing MR job on it every time is extremely hectic
and inefficient. Users must consider details like
ensuring consistency of data, the consumption of production system resources,
data preparation for provisioning downstream pipeline. So few approaches we can
think of-
·
Using Scripts – Extremely inefficient, so drop
it.
·
Direct Access – It gives overhead on production
server and even exposes the production system into risk of excessive load
originating from the cluster.
This is exactly where Apache Sqoop comes handy.
·
Sqoop allows import and export of data from
Structured Database such RDBMS, NoSQL etc.
·
Sqoop can fetch data from external system on to
HDFS, and populate data in HIVE and HBase.
·
Sqoop with Oozie can be used to schedule the
import and export
What happens underneath the covers when you run Sqoop is very
straightforward. The dataset being transferred is sliced up into different
partitions and a map-only job is launched with individual mappers responsible
for transferring a slice of this dataset. Each record of the data is handled in
a type safe manner since Sqoop uses the database metadata to infer the data
types.
Sqoop has 2 main versions-
·
Sqoop 1
·
Sqoop 2 – Compatible only with Hadoop 2
Installation:
Download the required version of scoop compatible with your
Hadoop version –
- - Download the tar file.
- - Extract the same in any directory of the file system.
- 1. Open the .bashrc of your user Home not root home
- 2. Our case ,home is /home/Hadoop
- 3. nano ~/.bashrc
-
Sqoop Installation can be tested
Sqoop help
Problems while installation
Version Mismatch with Hadoop
Version Mismatch with Hive
Version Mismatch with Hive
No Connector available with Sqoop 1 as Sqoop 2 already
provides most of the general use connectors
Basic Operations
Database Setup
-
Create a Database named demoproject
-
Now again connect to sql and use the database
-
To check the database, check
show Tables;
Sqoop Operations
Importing a table into Hadoop Hive
Sqoop import –connect jdbc:mysql://localhost/{databasename}
–username {username} –password {password} –table {tablename} –hive-import
Column Name Mismatch error
If any SQL Column Data Type is not supported in Sqoop 1,
then use the command-
--map-column-hive
sqoop import --connect
jdbc:mysql://172.18.167.204/RTNG
--username root --password root --table rt_order --hive-import
--map-column-hive ORDER_DATA=string
Setting Number of Map Jobs
If default Map jobs are not fine with you or you manually
want to set Map Jobs count, then –m {count} is the option. But number of more
maps won’t make it sure that you process will be faster all the time.
sqoop import
--connect jdbc:mysql://172.18.166.201/demoproject --username root --password
root --table superbigtable -m 15 --hive-import
Create Sqoop Jobs
sqoop job --create myjob2 -- import --connect jdbc:mysql://172.18.167.204/RTNG
--username root --password root --table rt_t_rtp_proctask_user_bpmagent
--hive-import --incremental
append --check-column "lastupdated" --last-value
"null" -m 1
Run the sqoop Job
sqoop job --exec myjob2
No comments:
Post a Comment