Data Engineer working with multiple Big Data technologies and Machine Learning: Internal of Hadoop Mapper Input to customize

Internal of Hadoop Mapper Input

Well I just got a requirement to somehow change the input split size to the Mapper , and just by changing the configuration didn't help me lot, so I moved further and just tried to understand exactly whats inside -

So above the Job flow in Mapper and the 5 methods are seriously something to do with split size.

Following is the way , an input is being processed inside Mapper -

- Input File is split by InputFormat class

- Key value pairs from the inputSplit is being generated for each Split using RecordReader

- All the generated Key Value pairs from the same split will be sent to the same Mapper, so a common unique mapper to handle all key value pairs from a specific Split

- All the result from each mapper is collected further in Partitioner

- The map method os called for each key-value pair once and output sent to the partitioner

- so now the above result in partioner is actually further taken into account by Reducer.

So Now I found the class InputFormat to just introduce my change and that is based on my requirement.

But further checking the exact class helped me more -

 @Deprecated  
 public interface InputSplit extends Writable {  
  /**  
   * Get the total number of bytes in the data of the <code>InputSplit</code>.  
   *   
   * @return the number of bytes in the input split.  
   * @throws IOException  
   */  
  long getLength() throws IOException;  
  /**  
   * Get the list of hostnames where the input split is located.  
   *   
   * @return list of hostnames where data of the <code>InputSplit</code> is  
   *     located as an array of <code>String</code>s.  
   * @throws IOException  
   */  
  String[] getLocations() throws IOException;  
 }

Further there few more things to check like TextInputFormat , SequenceFileInputFormat and others

Hold On.. We've RecordReader inbetween which splits the input in Key-value and what if I got something to do with it-

RecordReader.java Interface

We can find implementation of RecordReader in LineRecordReader or SequenceFileRecordReader.

Over there we can find that input split size crosses boundary sometimes, and such situation is being handled , so custom RecordReader must need to address the situation.

1 comment:

Unknown said...: Hi, This is Yasmin from Chennai. Thanks for sharing such an informative post. Keep posting. I did Big Data Training in Chennai at TIS academy. Its really useful for me to know more knowledge about Big Data. They also give 100% placement guidance for all students.; November 20, 2015 at 8:24 AM

Data Engineer working with multiple Big Data technologies and Machine Learning

Monday, March 30, 2015

Internal of Hadoop Mapper Input to customize

1 comment:

Python Java BigData Machine Learning Data Mining Developer