Databricks: Getting Started and Crafting High-Quality, Testable Solutions

This is the companion post to my presentation at Data Bristol meetup on – Databricks: Crafting High-Quality, Testable Solutions!

This September I’m making my debut as a presenter at a meetup as part of Databricks: Getting Started and Crafting High-Quality, Testable Solutions! I’ve been going to the Data Bristol meetup for a few years now. Its a great group I joined when I moved to this area. I have met lots of interesting people and obtained some great insights from the talks and dicussions. The September meetup is themed around Databricks. Holly Smith will be presenting and I’ll be sharing my insights from applying software engineering practices to data projects, something that databricks enables very well. The listing is here, https://www.meetup.com/databristol/events/302753494 . Connect with me on Linkedin here.

Databricks logo

Here are the links to the further readings I promised at the talk…

Background

Read the history and evolution of data platforms in one of the following books:

Fundamentals of Data Engineering

Fundamentals of Data Engineering book cover
Version 1.0.0

Available from O’Reilly

Also, available as a free download (in exchange for your email address) here.

Deciphering Data Architectures

Available from O’Reilly

Background of SWE and why these tools and processes exist in The Phoenix Project

The Phoenix Project book cover

Unit testing

Microsoft learn page on testing notebooks

Read more about Test Driven Development (TDD)

pytest logo

Specific functions for testing data frames on the Databricks blog

A good article on testing from the Apache Spark documentation

Linting

My post on the databricks linting plugin https://www.alexcole.net/databricks-linting-with-a-new-plugin-for-pylint/

See also Ruff for fast python linting

Automation

A great page from Microsoft learn that covers setting up git, refactoring out of the notebooks, adding tests and then running the test job as an automatic action. This was the basis of the demo with the linting setup added as an extra step.

https://learn.microsoft.com/en-us/azure/databricks/notebooks/best-practices

Here is another good article from the databricks blog

GitHub Repo

Here is the repo I was using for the demo part. Contains the code from the MS learn pages with the linting process and automation

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *