Training@TCWorkshop.com

Apache Spark

Apache Spark Courses

Open-source framework for big data processing. Uses resilient distributed datasets (RDDs). Apache Spark overcame the shortcomings of Hadoop, providing features such as real-time processing, reduced latency, and better abstraction of distributed systems.

Course Name	Hours	List Price	Content
Advanced Data Engineering with Databricks	14.00	$2,000.00
Description: In this course, you will build upon your existing knowledge of Apache Spark, Structured Streaming, and Delta Lake to unlock the full potential of the data lakehouse by utilizing the suite of tools provided by Databricks. This course places a heavy emphasis on designs favoring incremental data processing, enabling systems optimized to continuously ingest and analyze ever-growing data. By designing workloads that leverage built-in platform optimizations, data engineers can reduce the burden of code maintenance and on-call emergencies, and quickly adapt production code to new demands with minimal refactoring or downtime. The topics in this course should be mastered prior to attempting the Databricks Certified Data Engineer Professional exam. Please Contact Us for Class Dates
Optimizing Apache Spark on Databricks	14.00	$2,000.00
Description: In this course, you will cover five key problems that represent the vast majority of performance problems in an Apache Spark application: skew, spill, shuffle, storage, and serialization. With each of these topics, we explore coding examples based on 100 GB to 1+ TB datasets that demonstrate how these problems are introduced, how to diagnose these problems with tools like the Spark UI, and conclude by discussing mitigation strategies for each of these problems. We continue the conversation by looking at a series of key ingestion concepts that promote strategies for processing terabytes of data including managing Spark partition sizes, disk-partitioning, bucketing, z-ordering, and more. With each of these topics, we explore when and how each of these techniques should be implemented, new challenges that productionalizing these solutions might provide along with corresponding mitigation strategies. Finally, we introduce a couple of other key topics such as issues with data locality, IO caching and spark caching, pitfalls of broadcast Joins, and new features like Spark 3s Adaptive Query Execution and Dynamic Partition Pruning. We then conclude the course with discussions and exercises on designing and configuring clusters for optimal performance given specific use cases, personas, the divergent needs of various teams, and cross-team security concerns. Please Contact Us for Class Dates

Brand Courses

Topic Courses

Vendors

Apache Spark

Apache Spark Courses

Our Company

Courses & Certifications

Partners

800.639.3535

Brand Courses

Topic Courses

Vendors

Apache Spark

Apache Spark Courses

Subscribe to get our latest news, course offerings, & deals

Our Company

Courses & Certifications

Partners

800.639.3535