This is a follow on post to cover section 2 of the exam guide Databricks Data Engineer Professional Data Processing. You can read my post on Databricks Tooling here.
Section 2 is Data Processing (Batch processing, Incremental processing and Optimization). This is the largest single section of the exam counting for 30% of the total marks available.
I’ll be collecting some useful links on each topic from not only the databricks blog and documentation but also the Delta Lake and Spark documentation. The main themes here are partitioning and streaming.
DE Pro Section 2: Data Processing
- Describe and distinguish partition hints: coalesce, repartition, repartition by range, and Rebalance
Hints – Azure Databricks – Databricks SQL | Microsoft Learn
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-hints.html
- Contrast different strategies for partitioning data (e.g. identify proper partitioning columns to use)
When to partition tables on Azure Databricks – Azure Databricks | Microsoft Learn
Best practices — Delta Lake Documentation
Liquid clustering relating to partitioning Announcing Delta Lake 3.0 with New Universal Format and Liquid Clustering | Databricks Blog
- Articulate how to write Pyspark dataframes to disk while manually controlling the size of individual part-files.
- Articulate multiple strategies for updating 1+ records in a spark table (Type 1)
- Implement common design patterns unlocked by Structured Streaming and Delta Lake.
Table streaming reads and writes — Delta Lake Documentation
- Explore and tune state information using stream-static joins and Delta Lake
Structured Streaming Programming Guide – Spark 3.5.1 Documentation (apache.org)
Simplifying Streaming Data Ingestion into Delta Lake | Databricks Blog
- Implement stream-static joins
Transform data with Delta Live Tables – Azure Databricks | Microsoft Learn
Introducing Stream-Stream Joins in Apache Spark 2.3 | Databricks Blog
- Implement necessary logic for deduplication using Spark Structured Streaming
Structured Streaming Programming Guide – Spark 3.5.1 Documentation (apache.org)
- Enable CDF on Delta Lake tables and re-design data processing steps to process CDC output instead of incremental feed from normal Structured Streaming read
How to Simplify CDC With Delta Lake’s Change Data Feed | Databricks Blog
Use Delta Lake change data feed on Azure Databricks – Azure Databricks | Microsoft Learn
- Leverage CDF to easily propagate deletes
Table deletes, updates, and merges — Delta Lake Documentation
- Demonstrate how proper partitioning of data allows for simple archiving or deletion of data
Table streaming reads and writes — Delta Lake Documentation
Delta table streaming reads and writes – Azure Databricks | Microsoft Learn
- Articulate, how “smalls” (tiny files, scanning overhead, partitioning, etc) induce performance problems into Spark queries
Other Links
The Spark structured streaming programming guide is very good and can be founde here:
Structured Streaming Programming Guide – Spark 3.5.1 Documentation (apache.org)
Databricks blog filtered to streaming category Data Streaming | Databricks Blog
The Delta Lake Cheatsheet (databricks.com) is a useful documentat for reference.
This article wasn’t directly related to any single section but contains some important information
Handling “Right to be Forgotten” in GDPR and CCPA using Delta Live Tables (DLT) | Databricks Blog
One thought on “Databricks Data Engineer Professional Data Processing”