Databricks Data Engineer Professional Data Processing

This is a follow on post to cover section 2 of the exam guide Databricks Data Engineer Professional Data Processing. You can read my post on Databricks Tooling here.

Section 2 is Data Processing (Batch processing, Incremental processing and Optimization). This is the largest single section of the exam counting for 30% of the total marks available.

I’ll be collecting some useful links on each topic from not only the databricks blog and documentation but also the Delta Lake and Spark documentation. The main themes here are partitioning and streaming.

DE Pro Section 2: Data Processing

  • Describe and distinguish partition hints: coalesce, repartition, repartition by range, and Rebalance

Hints – Azure Databricks – Databricks SQL | Microsoft Learn

https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-hints.html

  • Contrast different strategies for partitioning data (e.g. identify proper partitioning columns to use)

When to partition tables on Azure Databricks – Azure Databricks | Microsoft Learn

Best practices — Delta Lake Documentation

Liquid clustering relating to partitioning Announcing Delta Lake 3.0 with New Universal Format and Liquid Clustering | Databricks Blog

  • Articulate how to write Pyspark dataframes to disk while manually controlling the size of individual part-files.
  • Articulate multiple strategies for updating 1+ records in a spark table (Type 1)
  • Implement common design patterns unlocked by Structured Streaming and Delta Lake.

Table streaming reads and writes — Delta Lake Documentation

  • Explore and tune state information using stream-static joins and Delta Lake

Structured Streaming Programming Guide – Spark 3.5.1 Documentation (apache.org)

Simplifying Streaming Data Ingestion into Delta Lake | Databricks Blog

  • Implement stream-static joins

Transform data with Delta Live Tables – Azure Databricks | Microsoft Learn

Introducing Stream-Stream Joins in Apache Spark 2.3 | Databricks Blog

  • Implement necessary logic for deduplication using Spark Structured Streaming

Structured Streaming Programming Guide – Spark 3.5.1 Documentation (apache.org)

  • Enable CDF on Delta Lake tables and re-design data processing steps to process CDC output instead of incremental feed from normal Structured Streaming read

How to Simplify CDC With Delta Lake’s Change Data Feed | Databricks Blog

Use Delta Lake change data feed on Azure Databricks – Azure Databricks | Microsoft Learn

  • Leverage CDF to easily propagate deletes

Table deletes, updates, and merges — Delta Lake Documentation

  • Demonstrate how proper partitioning of data allows for simple archiving or deletion of data

Table streaming reads and writes — Delta Lake Documentation

Delta table streaming reads and writes – Azure Databricks | Microsoft Learn

  • Articulate, how “smalls” (tiny files, scanning overhead, partitioning, etc) induce performance problems into Spark queries

How Databricks improved query performance by up to 2.2x by automatically optimizing file sizes | Databricks Blog

Other Links

The Spark structured streaming programming guide is very good and can be founde here:

Structured Streaming Programming Guide – Spark 3.5.1 Documentation (apache.org)

Databricks blog filtered to streaming category Data Streaming | Databricks Blog

The Delta Lake Cheatsheet (databricks.com) is a useful documentat for reference.

This article wasn’t directly related to any single section but contains some important information

Handling “Right to be Forgotten” in GDPR and CCPA using Delta Live Tables (DLT) | Databricks Blog

Related Posts

One thought on “Databricks Data Engineer Professional Data Processing

Leave a Reply

Your email address will not be published. Required fields are marked *