Spark for Data Engineers

Spark for Data Engineers

This course is designed to provide you with insights into how Spark interprets code and the associated execution plan within Qubole, what can cause sluggishness or job failure and how to think about parallelism management.

ABOUT THIS COURSE

LEARNING FORMAT:

Self-paced

DESCRIPTION:

This course is designed to provide you with and understanding of the Spark Execution Model, Parallelism Management and the Spark Job Server functionality within the Qubole platform.
Estimated time to complete this course: 30 mins.

Recommended Prerequisites:

Spark Execution Model

In this lesson you'll learn about:

  • "Lazy" Spark
  • Spark Transformations
  • Spark Actions
  • RDD Caching & Storage Level
  • The Spark Execution Plan
  • Stage Requirements
  • Spark Shuffle
  • Broadcast variables

Spark Parallelism Management

Managing parallelism in Spark helps optimize processing. In this lesson you'll learn about:

  • Spark Executors, Cores and Tasks
  • Serialization Management
  • Autoscaling
  • Dynamic Allocation
  • Memory Settings

Spark Job Server
Qubole provides a Spark Job Server that enables sharing of Resilient Distributed Datasets (RDDs) in a Spark application among multiple Spark jobs. In this lesson you'll learn about:

  • The Qubole Spark Job Server
  • Zeppelin as a Job Server

Recommended Follow Up:

Curriculum

  • Course Introduction
  • Course Terminology
  • Spark Execution Model
  • Spark Parallelism and Resource Management
  • Spark Cache Behavior
  • Qubole Executor AutoScaling
  • Spark Job Server
  • Course Conclusion
  • Hands on Lab
  • Getting Started w/ Hands on Lab
  • Spark Data Engineer (AWS) - Hands on Lab
  • Spark Data Engineer (GCP) - Hands on Lab
  • Spark Data Engineer (Azure) - Hands on Lab

ABOUT THIS COURSE

LEARNING FORMAT:

Self-paced

DESCRIPTION:

This course is designed to provide you with and understanding of the Spark Execution Model, Parallelism Management and the Spark Job Server functionality within the Qubole platform.
Estimated time to complete this course: 30 mins.

Recommended Prerequisites:

Spark Execution Model

In this lesson you'll learn about:

  • "Lazy" Spark
  • Spark Transformations
  • Spark Actions
  • RDD Caching & Storage Level
  • The Spark Execution Plan
  • Stage Requirements
  • Spark Shuffle
  • Broadcast variables

Spark Parallelism Management

Managing parallelism in Spark helps optimize processing. In this lesson you'll learn about:

  • Spark Executors, Cores and Tasks
  • Serialization Management
  • Autoscaling
  • Dynamic Allocation
  • Memory Settings

Spark Job Server
Qubole provides a Spark Job Server that enables sharing of Resilient Distributed Datasets (RDDs) in a Spark application among multiple Spark jobs. In this lesson you'll learn about:

  • The Qubole Spark Job Server
  • Zeppelin as a Job Server

Recommended Follow Up:

Curriculum

  • Course Introduction
  • Course Terminology
  • Spark Execution Model
  • Spark Parallelism and Resource Management
  • Spark Cache Behavior
  • Qubole Executor AutoScaling
  • Spark Job Server
  • Course Conclusion
  • Hands on Lab
  • Getting Started w/ Hands on Lab
  • Spark Data Engineer (AWS) - Hands on Lab
  • Spark Data Engineer (GCP) - Hands on Lab
  • Spark Data Engineer (Azure) - Hands on Lab