Is embedded Liquibase the right way?
Embedding Liquibase into the startup process of an application is a very common pattern for good reason
I agree with this statement from Liquibase blog to a large extent and use Liquibase in most of the Spring Boot + RDS projects I have worked on.
But embedding Liquibase into startup processes comes with caveats:
- DevOps friends would kindly point out that it violates Segregation of Duties practices to run an Application as Admin. This emits a bad smell.
- Database migration is a natural part of deployment instead of Application startup. When embedding Liquibase into an application, a risky database migration task is attempted every time a new instance comes up. This slows down the startup process, wastes computation resources and emits more carbon dioxide.
I knew the caveats above all along and I thought they were not too bad until a hot summer day at work in 2019.
In a small and crowded office with a broken air conditioner, a Post Incident Review meeting continued. I was explaining why our Account microservice had gone offline midday and could not recover itself without manual intervention against the production database of a financial service. The audience was a group of serious looking colleagues from Ops, DevOps, PM office and customer support team. After the root cause analysis we reviewed the incident response actions. Tension in the room eased a bit afterwards as the Time To Recovery was not too bad thanks to prompt collaborative actions of teams:
- We arranged an urgent Change Request via CR manager with Change Approach Board to run a SQL command against production database (10mins)
- Coordinated across teams to postpone remaining releases for the day (5mins)
- Pair-executed a SQL command to release a database deadlock then rebooted the service (20mins)
- Communicated to all related teams about the recovery
Colleagues in the room started showing some signs of relief. Just then, the production support channel on Slack started flashing with a new alarm on a few laptops around the table. You probably have guessed it right, that Account service went down again…
How exactly had the incidents happened
For the curious souls, the incident was an textbook example of the butterfly effect, which involved Liquibase embedded in a Spring Boot microservice.
Background: during startup of each new application instance, the embedded Liquibase waits for, obtains and then releases a database change lock to make sure database migration is performed safely in a clustered environment
We had released a new version to Account service that morning with a resource thirsty change. That change caused the service to hit the performance benchmark quite often so service instances were recycled much more frequently than usual.
As more instances entered the startup phase, competition for the same lock got intensive and the startup process prolonged further and further until it came close to the healthcheck timeout of the container orchestrator.
Eventually, there was an instance that got killed right after obtaining the lock without a chance to release it. That lock then became a Deadlock. After that, no new service instances would make their way through startup.
With no more new instances joining the force, remaining healthy services received more and more requests and fell down one by one until all service instances were out.
How we fixed it
The second emergency hotfix on that hot summer day
- to manually release the Liquibase deadlock and reboot the service [again]
- to bump up the service size to avoid frequent instance recycling, which increased the operational cost by quite a bit
How to release the deadlock
The standard fix is to run the Liquibase unlock command or connect to database directly and run a SQL update statement simple as
UPDATE DATABASECHANGELOGLOCK SET LOCKED=0, LOCKGRANTED=null, LOCKEDBY=null where ID=1; commit;
One step further — detect and release the deadlock automatically
Constant manual remediation to a production environment is unacceptable in regulated industries, hence automation of the remediation is naturally the next step.
At that time, it felt right to move on to schedule a query to monitor the DATABASECHANGELOCK table and release the lock if it has not been released beyond a threshold. I thought of three ways to schedule such a job within three seconds, how good is that?!
Feeling like we were on the correct path, I set off to find a reasonable threshold. It cannot be too short otherwise we risk running the migration twice; it cannot be too long because that means another potential production outage.
After some reflection on my past experiences, I decided to pick two minutes as a reasonable starting point for most releases without digging into history logs.
Time to validate the idea against some edge cases — what if a release involves a time-consuming database update that cannot not finish within five minutes? Setting it to two minutes will cause the lock to be released and database migrated twice. Maybe we can change the threshold for that particular release and revert back afterwards? It suddenly felt like playing catch-22. But production is definitely the wrong venue for that game.
Avoid the deadlock (an <Are your Lights On> moment)
After an aha moment, I realised the real problem here is how to avoid the deadlock rather than to release it.
Embedding Liquibase in Spring Boot applications is definitely out the window and manual database migration for every release was out even earlier. How about placing database migration back into the deployment stage where it has truly belonged all this time?
That sounds about right. A solution was formed soon afterwards where the key part of the solution was to execute the Liquibase database migration from the deployment pipeline. I will explain in detail soon.
There is nothing new under the sun
I told Shane my story and he got the problem instantly. After examining the deployment strategy, we realised that my previous solution would not work as is because Adatree’s SaaS offering was designed to distribute versioned service products across AWS account boundaries to customer accounts via AWS Service Catalog. The service registry uplifts the deployment model to a new level but the extra abstract layer also took away the access from Adatree’s deployment pipeline to runtime environments in our customers’ AWS accounts.
But, above all we are determined to avoid the deadlock issue and we love technical challenges like this. In the end we overcame the restriction utilising AWS Lambda Custom Resource and AWS Cloudformation.
Comparing those two solutions,
- Pipeline based solution is cloud agnostic and works fine when both the pipeline and target account are controlled by internal teams
- Lambda Custom Resource solution solves the challenge across account boundaries where target account is beyond the reach of deployment pipeline
The right way
- Database migration should happen as part of the deployment
- Application update should be dependent on successful Liquibase execution when database migration is needed
- Application update should be abandoned should the database migration fail
- Database migration task requires admin access to Database
- Application should only be granted app user access
- Pipeline based solution
If you have full control of the deployment pipeline and access to the runtime environment from the pipeline, this is the relatively simple solution for it. Liquibase could be invoked via Maven plugin, container image or SDK.
2. AWS Lambda-backed Custom Resource solution
If you are distributing your services as self-contained product packages to customer accounts via cloud native services such as AWS Service Catalog. It is likely that
- Deployment pipelines cannot reach the runtime environment
- Deployment pipelines have no write access to the runtime environment
So what do we do?
- We package the migration task as part of a versioned Service Catalog product so Liquibase only runs when there is a schema change
- We invoke the Liquibase task during deployment of the versioned product via AWS Lambda-backed Custom Resource
- We enforce the atomicity of database migration and application update using AWS CloudFormation dependency
p.s. For applications hosted on AWS and deployed using AWS CloudFormation, I would say “this is the way”. ;-)
Hope you have enjoyed this blog. Until next time, my friends.
We shape our tools and thereafter our tools shape us
Adatree focuses on enabling and empowering FinTech innovations via CDR (a.k.a Open Banking). We like to contribute back to the tech community as much as we can and we will opensource the second solution soon, please stay tuned.