Databricks linting with a new plugin for Pylint

Databricks labs released a plugin for the Python linter PyLint this week. I had a chance to play with it and here are the steps I used to try it out as well as some few thoughts.

You can install with pip – see here: https://pypi.org/project/databricks-labs-pylint/

Databricks labs are projects created by the field team to help customers, you can see the current projects here.

One of the big advantages of using the Databricks data intelligence platform is the code first nature of the platform. Writing code enables DevOps and CI/CD. The libraries, tooling and patterns for DevOps type tasks fit for with code developed on the Databricks platform. In the past many data tools required a visual editor that generated  XML. Some providers would add a basic git integration but engagement with the full set of CI/CD tools was not really possible. Many of the the old integrations were clumsy and felt like a box ticking exercise from the side of the supplier.

Linters

Linting has a history that goes all the way back to C and 1978 (see wikipedia). The purpose of a lint tool is to perform a series of checks in order to flag things like programming errors and stylistic errors. Linters are part of what is known as static analysis, that is to say they analyse computer programs without running them.

Linters can scan text files in your ide or text editor, can be run locally or as part of an automated process, eg when code is checked in. You can think of things like line length, indentation and capitalisation as some of the tasks a linter may be checking. The Google style guide for Java has been implemented in a linter that I had previously used. The documentation is here and is a good example of what a good linter can check for.

Pylint

PyLint is one of the popular linters in the Python space.

The Pylint plugin adds some Databricks specific checks to help code quality and adherence to standards.

By tapping into PyLint the Databricks add-on benefits from existing integrations built on Pylint such as

  • VSCode
  • IntelliJ/PyCharm
  • Airflow

And it can also be used as part of a

  • GitHub Action
  • Azure DevOps Task
  • GitLab CodeClimate

It is suggested that you run the plugin in isolation alongside Ruff. I also installed Ruff locally and it was very quick to scan the code.

VSCode Config

I had thought it was pretty easy getting started with VSCode and the Pylint extension but I ended up using the command line approach.

Pylint will highlight issues on Python files as you open them, but I could not see the specific Databricks warnings and errors.

I had tried to configure the plugin to use the extension by clicking on the cog on the right of the Pylint banner and select Extension Settings. From here I added the required item to the Plylint: Args

It could have been an issue with my configuration, but the errors I saw stated that it needed a filename to be passed as an argument. UPDATE! this now is working fine!

I could use from the command line easily though following the pip install. I verified it was working by running it with the suggested disable-all flag followed by the –enable flag with the specific databricks checks listed see here like this:

GitHub Action

I also added to GitHub as an action.

Having recently implemented a GitHub action to use the Databricks cli to run notebooks on pull request. I had followed along with the documentation here: https://learn.microsoft.com/en-us/azure/databricks/notebooks/best-practices using code that was published here: https://github.com/databricks/notebook-best-practices. I decided this would be my base project to extend with the linter.

These are the steps I took to configure.

First I added a linting.yml file to the .github/workflows folder already present in my repo

The contents of my file are here https://raw.githubusercontent.com/alcole/databricks-best-practices/main/.github/workflows/linting.yml. I used the documentation for the Pylint Action linked from the Databricks repo but added the steps required to run the checks on a push (on: [push]) and the necessary lines of configuration to pull it all together.

I added databricks-labs-pylint and pylint to requirements.txt file.

Then I added a pyproject.toml file to the root directory of the repo with the following contents

I had some issues getting it to run initially but this was due to the compatibility of some of the dependencies listed in the requirements.txt. It seemed the demo project used a very old version of pyarrow that was not compatible.

Adding badges to your project readme

For the first badge you can simply add this line to your readme

For the badge that shows the score and changes colour based on that score you can follow the instructions here: https://github.com/marketplace/actions/pylint-with-dynamic-badge

Some thoughts

I got very low scores on some of the files I was checking, not due to the Databricks checks but the Pylint checks. The repo I was using was designed to show and demonstrate particular functionality for learning purposes so wasn’t a good representation of a production project. It was interesting though to review the types of issues the linter flagged.

Getting up and running locally was fairly quick but the GitHub action setup was a bit more involved for me. This is mainly due to not having previously spent time with Pylint in the GitHub actions context. It would have been nice to see the Databricks specific issues on opening files in VSCode as the Pylint issues were showing at this time.

Some of the Databricks specific checks, such as %run or too many cells are good inclusions and don’t need much additional thought to understand. Looking at some of the Spark checks I did wonder about the reasoning behind why certain things were being checked. I did find an official Databricks Scala style guide but not an official one for PySpark although you can google for blog posts and Palantir’s PySpark standards. Reviewing the outputs and the project made me think a bit more about what makes a good Databricks project.

I’d not used Pylint before and I ended up downloading the Databricks plugin source code to spend some time with the implementation. I spent a bit of time on how the tests are implemented. You could potentially fork the project and add your own specific tests if needed. The astroid package is very powerful in this context and enables a lot of the logic to be implemented.

The suggestion to use ruff with the Databricks only checks for the plugin is definitely an approach I would consider taking to projects, rather than just using the base Pylint and Databricks plugin but further analysis would be required here. Ruff was working a lot faster than Pylint.

I have also seen the documentation expanded over the last few days with some useful information. It is now on version 0.2. Early days in the project but a very useful and interesting project that no doubt will grow over time and mature.

See more posts on Databricks here

Related Posts

One thought on “Databricks linting with a new plugin for Pylint

  1. Thanks for sharing the details of your experimentation and thoughts. It is great to learn from other’s experimentation to help kickstart ones own!

Leave a Reply

Your email address will not be published. Required fields are marked *