Thursday, September 10, 2015

Apache Spark 1.5 Released

Apache Spark 1.5 is released and now available to download 



http://spark.apache.org/downloads.html



Major features included-

  • First actual implementation of Project Tungsten  

  • Change in code generation 
  • performance improvement with Project Tungsten- 
  • As in my previous post , Spark introduced new visual for analyzing SQL and Dataframe .
  • Improvements and stability in Spark Streaming in the sense they actually tried to make batch processing and streaming closer. Python API for streaming machine learning algorithms: K-Means, linear regression, and logistic regression.

  • Include streaming storage in web UI. Kafka offsets of Direct Kafka streams available through Python API. Added Python API for Kinesis , MQTT and Flume.
  • Introduction for more algorithms for Machine learning and Analytics. Added more python Api for distributed matrices, streaming k-means and linear models, LDA, power iteration clustering, etc.
  • Find the release notes for Apache Spark -
    And now its time to use it more & actually use the Python API :-)

Sunday, September 6, 2015

Apache Spark 1.5 ,interesting new SQL tab in UI


As I was just exploring Apache Spark 1.5 developer version and checking new features, found an interesting new tab named sql :-).



Check details in the following -

https://www.linkedin.com/pulse/apache-spark-15-interesting-new-sql-tab-ui-abhishek-choudhary

Tuesday, May 12, 2015

Trouble Connecting Apache Spark with Hbase due to missing classes

Ideally when you try to connect HBase with Apache Spark, in most of the cases, it throws exception like ImmutableBytesWritableToStringConverter or Google Utils not found and various other errors while trying to run.

Almost all belongs to the same family of missing drivers.


To solve it straight forward,


Just go to spark-defaults.conf

update your spark.driver.extraClassPath with required libraries. keep on adding them.

like for missing ImmutableBytesWritableToStringConverter , add spark-examples-1.3.1-hadoop2.4.0.jar.


spark.driver.extraClassPath /Users/abhishekchoudhary/anaconda/anaconda/lib/python2.7/site-packages/graphlab/graphlab-create-spark-integration.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/hbase-server-0.98.6-cdh5.2.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/hbase-protocol-0.98.6-cdh5.2.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/hbase-hadoop2-compat-0.98.6-cdh5.2.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/hbase-client-0.98.6-cdh5.2.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/hbase-common-0.98.6-cdh5.2.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/htrace-core-2.04.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/lib/spark-examples-1.3.1-hadoop2.4.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/lib/spark-assembly-1.3.1-hadoop2.4.0.jar:/Users/abhishekchoudhary/bigdata/cdh5.2.0/hbase/lib/guava-12.0.1.jar




And one more thing , its actually ultra fast to access Hbase using Spark , so real-time updates


Sunday, May 10, 2015

HBase ignore comma while using bulk loading importTSV

HBase simply ignore Text while importing the same with CSV file, and the best part it didn't even inform you.
Entire job will be passed , but your HBase table won't have any data or partial data , like if any column has some values

"this text can be uploaded , but it has more" , then till uploaded it will be there in HBase Table cell , then rest of the contents are gone.
This is because I was importing TSV with seperator comma (,) and that lead to import engine to ignore comma among the csv cell.



It took 32 YARN jobs to figure out the actual issue.

Import CSV command -


create ‘bigdatatwitter’,’main’,’detail’,’other’


hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,other:contributors,main:truncated,main:text,main:in_reply_to_status_id,main:id,main:favorite_count,main:source,detail:retweeted,detail:coordinates,detail:entities,detail:in_reply_to_screen_name,detail:in_reply_to_user_id,detail:retweet_count,detail:id_str,detail:favorited,detail:retweeted_status,other:user,other:geo,other:in_reply_to_user_id_str,other:possibly_sensitive,other:lang,detail:created_at,other:in_reply_to_status_id_str,detail:place,detail:metadata" -Dimporttsv.separator="," bigdatatwitter file:///Users/abhishekchoudhary/PycharmProjects/DeepLearning/AllMix/bigdata3.csv

Monday, March 30, 2015

Internal of Hadoop Mapper Input to customize



Internal of Hadoop Mapper Input 

Well I just got a requirement to somehow change the input split size to the Mapper , and just by changing the configuration didn't help me lot, so I moved further and just tried to understand exactly whats inside -



So above the Job flow in Mapper and the 5 methods are seriously something to do with split size.

Following is the way , an input is being processed inside Mapper - 

- Input File is split by InputFormat class

- Key value pairs from the inputSplit is being generated for each Split using RecordReader




- All the generated Key Value pairs from the same split will be sent to the same Mapper, so a common unique mapper to handle all key value pairs from a specific Split

- All the result from each mapper is collected further in Partitioner

- The map method os called for each key-value pair once and output sent to the partitioner

- so now the above result in partioner is actually further taken into account by Reducer.


So Now I found the class InputFormat to just introduce my change and that is based on my requirement.



But further checking the exact class helped me more -

 @Deprecated  
 public interface InputSplit extends Writable {  
  /**  
   * Get the total number of bytes in the data of the <code>InputSplit</code>.  
   *   
   * @return the number of bytes in the input split.  
   * @throws IOException  
   */  
  long getLength() throws IOException;  
  /**  
   * Get the list of hostnames where the input split is located.  
   *   
   * @return list of hostnames where data of the <code>InputSplit</code> is  
   *     located as an array of <code>String</code>s.  
   * @throws IOException  
   */  
  String[] getLocations() throws IOException;  
 }  


Further there few more things to check like TextInputFormat , SequenceFileInputFormat and others

 Hold On.. We've RecordReader inbetween which splits the input in Key-value and what if I got something to do with it-

RecordReader.java Interface

We can find implementation of RecordReader in LineRecordReader or SequenceFileRecordReader.



Over there we can find that input split size crosses boundary sometimes, and such situation is being handled , so custom RecordReader must need to address the situation.


Tuesday, February 17, 2015

Things to know before Big Data & Machine Learning

When I started with Big data , I started with Hadoop .
When I started with Machine learning , I started with Linear Regression .

But time by time I realized I didn't make the best choice to do so.
why ?????????????????


Because I missed the core of the technologies but even I managed to finish the job but that doesn't mean I did it right. I missed the gap between learning and knowing the technology , I missed the fundamental behind the specific .





So I personally preferred following before Big Data and Machine Learning-

Don't confuse.. well Big Data & Machine Learning are 2 different things but they need each other. And You will know it once you do it :-)




Things to Do before Map-Reduce -


  • First understand Map-Reduce DATASTRUCTURE.
  • Write your own implementation of Map Reduce which is ultra easy without using any framework like hadoop or Spark. 
  • Refresh Graph Technologies like DFS & BFS
  • Explore some basic Dynamic Programming and Greedy Algorithm like Knapsack , LCS, Floyd Warshell , KMP etc.
  • SQL or RDBMS or precisely Data Model , Relational Algebra 



Things to do before Machine Learning -

  • Mathematics 
  • Vector , Scalar
  • Matrix Multiplication , Addition and other basic operations
  • Linear Formulation
  • Probability , conditional and independent
  • Probability Distribution
  • Basics of Permutation and Combination
  • Hypothesis
  • Very basic Statistics like Mean, Median , Standard Deviation variation 
  • Regression




Ok seems like lot to do before you even start of.... Well practically its not.
Everything is either from High School or College, so ideally you just need to refresh your memory and it will actually bring excitement to start with .


Warning : This list is going to grow further :-)