I recently took and passed the Machine Learning Professional Certification from Databricks. I’ll share some thoughts here and a few links to useful resources as well. I tend to try and read some written resources as a revision strategy after completing the learning path. There are many good posts on the Databricks blog that add some nice context and readability on top of the documentation.
All information is relevant to the syllabus that was issue 1st April 2024, which is current at time of writing. You can check the latest syllabus at this page, https://www.databricks.com/learn/certification/machine-learning-professional
If you are planning on taking this exam, then do follow the learning plan on the learning portal in addition to reading up with the links below.
If you are interested in my revision links for the Databrick Data Engineer Professional exam, see here.
Background Reading
First recommendation is the Big Book of MLOps. Read about it and get the pdf from this page. This document will give you a lot of the big picture that you’ll need when studying the details for the exam. I find setting big picture alignment early on helps with mapping concepts to the specifics as you learn.
The MLOps End to End Pipeline from dbdemos is a very quick and easy way to get hands on with many of the topics in the certification.
Section 1: Experimentation
This section is worth 30% of the overall available marks for this exam. 2 main topics here, data management with Delta tables and Experiment tracking with MLflow.
If you have studied the Data Engineer or Data Analyst certifications then you’ll be up to speed on the advantages of using Delta. Revising this topic for the perspective of the ML engineer reminds you of the top level features for the use case, you can read more here.
Experiment tracking is handled by the managed MLflow component in Databricks. Experiment tracking is such a great use case and you can use the autologging feature to make it even easier. This is a great example of feature design being unobtrusive and making the workflow easier.
Here is an intro page from Databricks and the MLflow documentation for experiment tracking.
You can even use MLflow locally via a simple pip install. The quickstart guide for the tracking server covers the basics and is an easy way to get a feel for the process it is supporting.
Section 2: Model Lifecycle Management
Another section making up 30% of the available marks, again focused on the capabilities of MLflow.
Preprocessing logic: Read about model flavours, especially pyfunc, on the MLflow site. This is another great capability of MLflow, that also helps it fit into the wider workflow well. This page covers the more of the why and gives some more useful details.
Model Management
Model registry pages on the MLflow site will give you the concept definitions and an overview of the workflow. Once you have the concepts and workflow straight in your head revising, the specifics are far easier.
Model Lifecycle Automation:
Databricks blog article on Model Registry Webhooks
Section 3: Model Deployment
This section of the exam covers the 3 deployment patterns, batch, streaming and real-time.
I recently heard batch being described as a special case of real-time, where the data accumulates for longer before being processed. As a result, there is an impact on how you manage the data (partitioning, z-ordering, out of order streaming arrival) and trade-offs for each approach. If you have studied the Data Engineering exams from Databricks many of the concepts relating to batch and streaming in this exam will be familiar.
A blog post on real-time feature computation that also covers some of the reasons for batch and streaming. Unity Catalog is now used as a feature store, but this is was not the case when this blog was written. https://www.databricks.com/blog/best-practices-realtime-feature-computation-databricks.
Section 4: Solution and Data Monitoring
All about drift! Read an introduction to drift in ML on the Databricks blog. Several types of drift can occur so knowing the definitions of feature, label, prediction and concept drift is required, as are the actions required to take when faced with each drift type.
There are several approaches to monitoring and testing for drift including summary statistic monitoring, Jenson-Shannon divergence, Kolmogorov-Smirnov test and chi-squared. Knowing when to choose an approach will be tested on the exam.
Some closing thoughts
I really enjoyed learning more about taking machine learning to production with the Databricks platform.
MLflow is an integral part of the processes tested in this exam. Its a great project and has great documentation. You can use this project as part of the Databricks platform or as a Python library, and it has even been integrated into Azure ML . Seeing the workflows that MLflow supports really shows how the space has matured in the past 15 years, and demonstrates the software engineering practices and DevOps techniques that are being applied in this space now.
As a student of mathematics and computer science its great to see names like Shannon and Kolmogorov in this context. Ideas that initially came out of information theory and statistics still being relevant today and applied in modern technology really shows the ingenuity of the researchers. The innovations they provided many decades ago help enable us to tackle the complex problems we are facing today.
Overall, this was a very interesting topic for a professional exam and I can immediately see the benefits of using Databricks and MLflow to support ML projects and taking them to production.
If you’re a data architect or data engineer looking to enhance your skills and stay ahead of the curve, consider pursuing the Databricks Machine Learning Professional Certification. This certification will provide you with a solid foundation for collaborating with ML engineers and data scientists, and help you bring ML projects to production efficiently.
Start your journey today by exploring the official syllabus and utilizing the resources mentioned in this post. If you have any questions or want to share your own experiences, feel free to leave a comment below. Let’s support each other in achieving our professional goals!
One thought on “Machine Learning Professional from Databricks”