My customer has a data warehouse solution. It consists of nine databases with a well defined architecture. They need to be deployed in a specific order – the source database first, followed by various dependant databases in sequence.
Due to the tight coupling of the databases, they all exist in the same git repo in Azure DevOps Services, with each database in a different directory. In our first attempt to set up a CI process, a commit for any database triggered a rebuild and deployment for all 9 databases one after the other. Some databases took longer than others, but on average, at about 5 minutes per database, the entire build took about an hour. This long build time was becoming a real pain for the team.
We wanted to change our configuration in Azure DevOps Services so that when a developer pushes some changes, only the databases that they updated would get rebuilt. We hoped this would get the average build time down to 5 or 10 minutes. Crucially, however, the databases still needed to be built in the correct order.
The solution we chose required me to learn to use the Azure DevOps Services API from a PowerShell script. I had never done this before, and it wasn’t as easy as I hoped it would be, so having figured it out (with a little help from the folks referenced below) I wanted to document and share what I’d learned.
This blog post would not have been possible without the help and advice of various community folks. Thank you Giulio Vian for replying to my email to the MVP distribution list. Thank you to Chris Gardner and Richard Fennell for the Generate Release Notes Azure DevOps extension and sharing the code on GitHub. Thank you to Donovan Brown for this blog post. And thank you to Rik Hepworth for your time earlier today at MVP Summit.
The new build process
I’d considered a few ways to solve the problem. First, I wanted to set a path filter per build task – effectively disabling build steps if no changes were detected in a given directory. I knew I could do this at the build level by adding a Path Filter to the build trigger – but there isn’t a way to add a path filter for a particular task within a build.
Next I thought about splitting each database into a separate build pipeline. This would allow databases to be build independently, but where multiple databases were updated, I still wanted them to build in the right order. I could set up a “Build Completion” trigger for the downstream databases, but then I’d be back to running all the databases one by one again.
I could have added a table to each database and updated my build process to write the commit hash to it each time I ran a build, then I could query that table within my build script to see if the database had been updated. However, since all my databases were in the same git repo, simply comparing the most recent commits wouldn’t work. I would probably need to rely on manual updates or running a schema compare – which would take as long as simply building the database and was beginning to sound complicated.
Finally I ended up back at my first solution. It seemed by far the simplest. However, since this wasn’t available out of the box, I was going to have to use the API to make it work.
Creating an API call in PowerShell using a hosted agent
I wanted to use the API to get a list of commit hashes associated with the current build. Then I could use git log to find the most recent commit hash for a given file path. If the commit hash for the given file path appeared in the list of hashes for the current build I knew that something in that directory had changed and I needed to build the database in that directory. Otherwise I knew nothing had changed and I could skip that directory.
To get a list of git hashes associated with the current build, I tried to use the syntax documented here. Unfortunately, simply by comparing Donovan’s post and the docs, I really struggled to figure out how to make the authentication work. I was correctly encoding my PAT token but nothing was happening.
The problem was that for my POC I’d been using an Azure DevOps hosted agent and the hosted agent was running under a service account. Following my conversation with Rik Hepworth and some rooting around Richard and Chris’s code (referenced above) we found this little gem, which allowed me to uses the default system PAT token on the build agent:
$pat = “Bearer $env:System_AccessToken”
I also had to set the Agent Job to access the OAuth token:
Now I had all the data I needed. All that was left was to write a script (to run at the start of the build) that takes a parameter for each file path containing a database, tests whether each path has seen any updates, and then sets a variable to “true” or “false”. To finish it off, I add a “Run condition” to all the corresponding build steps for each of the individual databases using this funky syntax. If the variable is set to “true”, the database is built, otherwise the task is skipped:
I also set a couple of variables in my pipeline to hold the relative paths to my database directories from the git root:
Here’s the script. I added it to the root of my source control (and some of the code relies on that – I should tidy that up but it’s 11pm and I’m jetlagged so I’ll leave that as a task for the user) but if I was to implement this in a production environment (rather than my dev POC instance) I would stick this script in a “BuildScripts” directory etc and do whatever’s necessary to fix the relative paths to the database directories:
- This script should probably be updated not just to check all the git hashes from the current build, but all the git commit hashes for every build since the last successful build. (If a database build fails I want to keep building it every time I run a build until the build passes again. Otherwise, if someone else commits changes to another database the build will probably run successfully, despite the fact one of the databases is broken.) You can see an example of how that might be achieved by examining the Get-CommitInfoFromGitRepo function in this script.
- You might want to add additional manual overrides so that you can manually trigger a complete build with all databases or with a specific set of databases on demand. I would do this by creating various variables that are defaulted to false, but which could be turned on as needed for manual builds. Then those variables could override the default logic in the script to turn databases on or off as required. You might also like to run a full build on a nightly basis, just to double-check no broken cross-database dependencies were missed during the day when we optimised for speedy builds.