Databricks Data Engineer Professional Data Processing

This is a follow on post to cover section 2 of the exam guide Databricks Data Engineer Professional Data Processing. You can read my post on Databricks Tooling here.

Section 2 is Data Processing (Batch processing, Incremental processing and Optimization). This is the largest single section of the exam counting for 30% of the total marks available.

I’ll be collecting some useful links on each topic from not only the databricks blog and documentation but also the Delta Lake and Spark documentation. The main themes here are partitioning and streaming.

DE Pro Section 2: Data Processing

Describe and distinguish partition hints: coalesce, repartition, repartition by range, and Rebalance

Hints – Azure Databricks – Databricks SQL | Microsoft Learn

https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-hints.html

Contrast different strategies for partitioning data (e.g. identify proper partitioning columns to use)

When to partition tables on Azure Databricks – Azure Databricks | Microsoft Learn

Best practices — Delta Lake Documentation

Liquid clustering relating to partitioning Announcing Delta Lake 3.0 with New Universal Format and Liquid Clustering | Databricks Blog

Articulate how to write Pyspark dataframes to disk while manually controlling the size of individual part-files.

Articulate multiple strategies for updating 1+ records in a spark table (Type 1)

Implement common design patterns unlocked by Structured Streaming and Delta Lake.

Table streaming reads and writes — Delta Lake Documentation

Explore and tune state information using stream-static joins and Delta Lake

Structured Streaming Programming Guide – Spark 3.5.1 Documentation (apache.org)

Simplifying Streaming Data Ingestion into Delta Lake | Databricks Blog

Implement stream-static joins

Transform data with Delta Live Tables – Azure Databricks | Microsoft Learn

Introducing Stream-Stream Joins in Apache Spark 2.3 | Databricks Blog

Implement necessary logic for deduplication using Spark Structured Streaming

Structured Streaming Programming Guide – Spark 3.5.1 Documentation (apache.org)

Enable CDF on Delta Lake tables and re-design data processing steps to process CDC output instead of incremental feed from normal Structured Streaming read

How to Simplify CDC With Delta Lake’s Change Data Feed | Databricks Blog

Use Delta Lake change data feed on Azure Databricks – Azure Databricks | Microsoft Learn

Leverage CDF to easily propagate deletes

Table deletes, updates, and merges — Delta Lake Documentation

Demonstrate how proper partitioning of data allows for simple archiving or deletion of data

Table streaming reads and writes — Delta Lake Documentation

Delta table streaming reads and writes – Azure Databricks | Microsoft Learn

Articulate, how “smalls” (tiny files, scanning overhead, partitioning, etc) induce performance problems into Spark queries

How Databricks improved query performance by up to 2.2x by automatically optimizing file sizes | Databricks Blog

Databricks Data Engineer Professional Data Processing

DE Pro Section 2: Data Processing

Other Links

One thought on “Databricks Data Engineer Professional Data Processing”

Leave a Reply Cancel reply

Databricks Data Engineer Professional Data Processing

DE Pro Section 2: Data Processing

Other Links

Related Posts

Databricks Certified Generative AI Engineer Associate – Comprehensive Resource Guide

Databricks Certified Data Engineer Professional – Comprehensive Resource Guide

One thought on “Databricks Data Engineer Professional Data Processing”

Leave a Reply Cancel reply