Apache Spark 2.0 Features
New APIs
- Apache Spark 2.0 can run all 99 TPC-DS queries
- Apache Spark 2.0 will merge DataFrame to DataSet[Row]
- DataFrames are collections of rows with a schema
- Datasets add static types, eg. DataSet[Person], actually brings type safety over DataFrame
- Both run on Tungsten
in 2.0 DataFrame and DataSets will unify
case class Person(email: String, id : Long, name: String)
val ds = spark.read.json("/path/abc/file.json").as[Person]
- Introducing SparkSession
SparkSession is the "SparkContext" for Dataset/DataFrame.
So if anyone using only DataFrame/DataSet, can simply use SparkSession but SparkContext and all older APi will remain
Code demo1 DataSets Type safety demo
Code demo2 Spark Session demo
Long term Prospects
- RDD will remain the low-level API in Spark
- Datasets & DataFrames give richer semantics and optimizations
- DataFrame based ML pipeline API will become the main MLLib API
- ML model & pipeline persistance with almost complete coverage
- Improved R Support
Spark Structured Streaming
Spark 2.0 is introducing Spark Infinite DataFrames using the same single API
logs = ctx.read.format("json").stream("source")
- High-level streaming API build on Spark SQL Engine which extends DataFrames/ Datasets and includes all functions available in the past
- Supports interactive & batch queries like aggregate data in a stream , change queries in runtime etc.
Tungsten phase 2.0
Phase 1 is basically for robustness and Memory Management for Tungsten
Tungsten phase 2.0 speed up Spark by 10X by using Spark as Compiler
Code demo3 Tungsten Phase 2 demo
2nd Week of May , Spark 2.0 preview version will release and May end or June will have official version of Spark 2.0
No comments:
Post a Comment