Spark Performance Tuning For Data Engineers: Part1 - Storage
Published 5/2025
MP4 | Video: h264, 1920x1080 | Audio: AAC, 44.1 KHz
Language: English | Size: 1.36 GB | Duration: 3h 23m
Published 5/2025
MP4 | Video: h264, 1920x1080 | Audio: AAC, 44.1 KHz
Language: English | Size: 1.36 GB | Duration: 3h 23m
Data Engineering & Apache Spark Optimization Techniques on Databricks to Boost Speed, Reduce cost & Handle Big Data
What you'll learn
Hands on Demo based on different Scenarios & Usecases
Learn the nuances of spark performance tuning
Get detailed insights about different operations in spark
Get clear understanding about how spark configs work hand in hand & best combination for optimal results
Learn to identify and solve bottlenecks & errors in your spark application
Requirements
Basic Spark Architecture & internals
Spark programming in PySpark or Scala
Databricks Cloud Platform
Description
Unlock the true potential of Apache Spark by mastering storage-related performance tuning techniques. This hands-on course is packed with real-world scenarios, guided demos, and practical use cases that will help you fine-tune Spark storage strategies for speed, efficiency, and scalability.This course is perfect for Intermediate Data Engineers & Spark Developers as well as Aspiring Achitects who wants to optimize Spark jobs, reduce resource costs, and ensure fast, reliable performance for large-scale data applications.What You’ll Learn1. Understand how Apache Spark handles storage internally: memory vs disk2. Learn when and how to use Spark caching and persistence effectively3. Compare and choose the right storage levels: MEMORY_ONLY, MEMORY_AND_DISK, etc.4. Use real-world examples and hands-on demos to benchmark storage decisions5. Learn how to monitor storage metrics using the Spark UI6. Handle memory spills, disk I/O bottlenecks, and storage tuning in cluster environments7. Apply best practices for storage optimization in cloud and on-prem Spark clustersWhy Take This Course?100% Hands-on: Focused on practical implementation, not just theoryDesigned for Data Engineers, Spark Developers, and Big Data PractitionersCovers both foundational concepts and advanced tuning techniquesTeaches how to measure performance gains using real metricsHelps you make cost-efficient decisions for big data storageTools & Technologies CoveredApache Spark (2.x and 3.x)DataBricksSpark UIHDFS, DataLake (for storage scenarios)
Overview
Section 1: Introduction
Lecture 1 Introduction
Lecture 2 What is Optimization
Lecture 3 What is Benchmarking
Section 2: Important Concepts
Lecture 4 Spark High Level Architecture
Lecture 5 Spark Job Execution
Lecture 6 Reading Spark UI
Lecture 7 Physical Plans & DAG - Part 1
Lecture 8 Physical Plans & DAG - Part 2
Section 3: Optimizing Storage
Lecture 9 Schema Inference Problem
Lecture 10 Reuse DataFrame
Lecture 11 Column Elimination
Lecture 12 Row Elimination
Lecture 13 Directory Scan Problem
Lecture 14 Optimal File Size
Lecture 15 Haystack Query
Data Engineers & Spark Developers as well as Aspiring Achitects curious about advanced techniques of Performance Tuning & Optimization