Databricks Data Engineer Professional Databricks Tooling

I’ve collected some links for the Databricks tooling part of the Databricks Certified Data Engineer Professional exam.

Having completed the videos and notebooks a while back I wanted to prepare for the exam but I prefer to read rather than watch videos. The exam guide can be found here and full information on the exam is here. The exam guide was updated on 1st April 2024 – be sure to check for changes when you are preparing to take this certification.

From the exam outline we have 6 sections and section 1 is on Databricks Tooling. I’ve tried to use some of the excellent pages from the Databricks blog but have also included links to the documentation as well. The main page of the blog can be accessed here and contains many great posts relating to all topics in the Databricks world but you can choose to filter by data engineering as well.

Delta Lake is the main topic for this section and has its own site with documentation and a blog.

DE Pro Section 1: Databricks Tooling

This section accounts for 20% of the available marks on the exam

  • Explain how Delta Lake uses the transaction log and cloud object storage to guarantee atomicity and durability
  • Describe how Delta Lake’s Optimistic Concurrency Control provides isolation, and which transactions might conflict

This excellent post provides information on the first 2 elements in this exam section

Understanding the Delta Lake Transaction Log – Databricks Blog

  • Describe basic functionality of Delta clone.

This link goes beyond the differences in deep and shallow and describes some of the use cases and scenarios where they may be used.

How to Easily Clone Your Delta Lake Data Tables with Databricks – The Databricks Blog

  • Apply common Delta Lake indexing optimizations including partitioning, zorder, bloom filters, and file sizes

Optimizations — Delta Lake Documentation

When to partition tables on Azure Databricks – Azure Databricks | Microsoft Learn

This next link contains this image which is great way to understand the basic concepts of Z-ordering

Processing Petabytes of Data in Seconds with Databricks Delta | Databricks Blog

If you have not encountered bloom filters before, they are a very interesting data structure. Its worth investigating more, you can start here or here

https://learn.microsoft.com/en-us/azure/databricks/optimizations/bloom-filters

Data skipping for Delta Lake – Azure Databricks | Microsoft Learn

Configure Delta Lake to control data file size – Azure Databricks | Microsoft Learn

  • Implement Delta tables optimized for Databricks SQL service

How Databricks improved query performance by up to 2.2x by automatically optimizing file sizes | Databricks Blog

What’s new with Databricks SQL? | Databricks Blog

  • Contrast different strategies for partitioning data (e.g. identify proper partitioning columns to use)

I think this part is best served by understanding the content above and knowing when to use the different features

Other Links

That’s it for links that fall into each section. Here are some additional resources that relate to this overall topic and should be useful

Announcing Delta Lake 3.0 with New Universal Format and Liquid Clustering | Databricks Blog

There is also this ebook that is free in exchange for your email address Delta Lake: Up & Running by O’Reilly | Databricks. There are 10 chapters on Delta Lake to support your knowledge!

Related Posts

One thought on “Databricks Data Engineer Professional Databricks Tooling

Leave a Reply

Your email address will not be published. Required fields are marked *