Data Engineer working with multiple Big Data technologies and Machine Learning: Apache Spark 2.0 Features

Apache Spark 2.0 Features

New APIs

Apache Spark 2.0 can run all 99 TPC-DS queries

Apache Spark 2.0 will merge DataFrame to DataSet[Row]

- DataFrames are collections of rows with a schema
- Datasets add static types, eg. DataSet[Person], actually brings type safety over DataFrame
- Both run on Tungsten

in 2.0 DataFrame and DataSets will unify

case class Person(email: String, id : Long, name: String)
val ds = spark.read.json("/path/abc/file.json").as[Person]

Introducing SparkSession

SparkSession is the "SparkContext" for Dataset/DataFrame.

So if anyone using only DataFrame/DataSet, can simply use SparkSession but SparkContext and all older APi will remain

Code demo1 DataSets Type safety demo

Code demo2 Spark Session demo

Long term Prospects

RDD will remain the low-level API in Spark
Datasets & DataFrames give richer semantics and optimizations
DataFrame based ML pipeline API will become the main MLLib API
ML model & pipeline persistance with almost complete coverage
Improved R Support

Spark Structured Streaming

Spark 2.0 is introducing Spark Infinite DataFrames using the same single API

logs = ctx.read.format("json").stream("source")

High-level streaming API build on Spark SQL Engine which extends DataFrames/ Datasets and includes all functions available in the past
Supports interactive & batch queries like aggregate data in a stream , change queries in runtime etc.

Tungsten phase 2.0

Phase 1 is basically for robustness and Memory Management for Tungsten

Tungsten phase 2.0 speed up Spark by 10X by using Spark as Compiler

Code demo3 Tungsten Phase 2 demo

2nd Week of May , Spark 2.0 preview version will release and May end or June will have official version of Spark 2.0

Data Engineer working with multiple Big Data technologies and Machine Learning

Thursday, May 5, 2016

Apache Spark 2.0 Features

Apache Spark 2.0 Features

New APIs

Long term Prospects

Spark Structured Streaming

Tungsten phase 2.0

No comments:

Python Java BigData Machine Learning Data Mining Developer