This is the companion post to my presentation at Data Bristol meetup on – Databricks: Crafting High-Quality, Testable Solutions!
This September I’m making my debut as a presenter at a meetup as part of Databricks: Getting Started and Crafting High-Quality, Testable Solutions! I’ve been going to the Data Bristol meetup for a few years now. Its a great group I joined when I moved to this area. I have met lots of interesting people and obtained some great insights from the talks and dicussions. The September meetup is themed around Databricks. Holly Smith will be presenting and I’ll be sharing my insights from applying software engineering practices to data projects, something that databricks enables very well. The listing is here, https://www.meetup.com/databristol/events/302753494 . Connect with me on Linkedin here.
Here are the links to the further readings I promised at the talk…
Background
Read the history and evolution of data platforms in one of the following books:
Fundamentals of Data Engineering
Available from O’Reilly
Also, available as a free download (in exchange for your email address) here.
Deciphering Data Architectures
Available from O’Reilly
Background of SWE and why these tools and processes exist in The Phoenix Project
Unit testing
Microsoft learn page on testing notebooks
Read more about Test Driven Development (TDD)
Specific functions for testing data frames on the Databricks blog
A good article on testing from the Apache Spark documentation
Linting
My post on the databricks linting plugin https://www.alexcole.net/databricks-linting-with-a-new-plugin-for-pylint/
See also Ruff for fast python linting
Automation
A great page from Microsoft learn that covers setting up git, refactoring out of the notebooks, adding tests and then running the test job as an automatic action. This was the basis of the demo with the linting setup added as an extra step.
https://learn.microsoft.com/en-us/azure/databricks/notebooks/best-practices
Here is another good article from the databricks blog
GitHub Repo
Here is the repo I was using for the demo part. Contains the code from the MS learn pages with the linting process and automation