Thursday, May 5, 2016

Apache Spark 2.0 Features


Apache Spark 2.0 Features

New APIs


  • Apache Spark 2.0 can run all 99 TPC-DS queries

  • Apache Spark 2.0 will merge DataFrame to DataSet[Row]

- DataFrames are collections of rows with a schema
- Datasets add static types, eg. DataSet[Person], actually brings type safety over DataFrame
- Both run on Tungsten

in 2.0 DataFrame and DataSets will unify
case class Person(email: String, id : Long, name: String)
val ds = spark.read.json("/path/abc/file.json").as[Person]

  • Introducing SparkSession
SparkSession is the "SparkContext" for Dataset/DataFrame.
So if anyone using only DataFrame/DataSet, can simply use SparkSession but SparkContext and all older APi  will remain


Code demo2 Spark Session demo

Long term Prospects

  • RDD will remain the low-level API in Spark
  • Datasets & DataFrames give richer semantics and optimizations
  • DataFrame based ML pipeline API will become the main MLLib API
  • ML model & pipeline persistance with almost complete coverage
  • Improved R Support


Spark Structured Streaming

Spark 2.0 is introducing Spark Infinite DataFrames using the same single API
logs = ctx.read.format("json").stream("source")

  • High-level streaming API build on Spark SQL Engine which extends DataFrames/ Datasets and includes all functions available in the past
  • Supports interactive & batch queries like aggregate data in a stream , change queries in runtime etc.

Tungsten phase 2.0

Phase 1 is basically for robustness and Memory Management for Tungsten

Tungsten phase 2.0 speed up Spark by 10X by using Spark as Compiler


Code demo3 Tungsten Phase 2 demo




2nd Week of May , Spark 2.0 preview version will release and May end or June will have official version of Spark 2.0