As software engineers, sometimes we need to release significant code changes. And sometimes, we have to do it in a ‘big bang’. I’ve seen a few of these, both in an enterprise context (with weekend-long cutovers managed by teams of 20) and more recently at GoCardless. Here are some things I’ve learned along the way.
I am defining a cutover as releasing any software change in a ‘big bang’ fashion (i.e. it will impact all your customers at once). Cutovers generally represent points of high risk in software: stuff rarely breaks when you’re not changing something (scaling issues aside), and if you are forced to release something to everyone at once, then if stuff breaks it’s likely to have a significant impact. While this post has advice about how to de-risk such events, the best plan is to avoid them all together. In modern multi-tenant systems, there are many ways of slowly applying a change (feature flags etc.) which can help avoid the big bang, but assuming you have no other options, here are some things to consider.
When designing the ‘to-be state’ of your system, it’s worth taking time to think about how you can get from here to there (i.e. thinking about the cutover). This shouldn’t fundamentally alter your design (that feels like the tail wagging the dog), but there are likely to be quick wins which help the cutover without cannibalising the design. There’s just no need to make your life any more difficult than it already is. It’s also worth thinking about potential intermediate steps at this early stage, which can have other benefits in terms of incrementally releasing code and learning more about it’s behaviour before the ‘big bang’ day.
To be honest, successful cutovers are almost all about planning (the whole blog post could be titled planning), but these are some specific things to think about.
What could possibly go wrong?
So called ‘pre-mortems’ are a really useful exercise: they help identify the high risk parts of a change and trigger great conversations about possible mitigations. They also help everyone to gain a shared understanding of what people are scared of and why, which will help make good decisions going forwards.
When should we do this?
Consider when you will have the right people in the office in case of emergencies (cutovers over Christmas can be scary) but also your customers: it might be that an issue on your side results in your customer having to take some manual action. They’ll be much happier being told that at 11am on a Tuesday than 5pm on a Friday, or on Christmas Eve after their code freeze is in place. Of course you also want to consider usage patterns to avoid peaks. The riskiest thing you can do is to combine lots of cutovers together (which happens a lot in enterprises due to the release management process). It’s better to make two changes on different days where one isn’t quite at the perfect time, than decide that there is a perfect cutover slot once per quarter and increase your risk by making all your changes at once. This also goes back to our discussion above: making changes incrementally is often a lot safer.
Who do we need to tell?
Some cutovers might only involve a couple of people (perhaps a routine infrastructure upgrade) while others might involve a variety of teams across the company, and need lots of external comms to prepare customers (e.g. docker introducing rate limits). This is very much ‘horses for courses’ - you need to consider the likelihood of failure, the impact of those potential failure scenarios and of course whether customers will notice a behaviour change. Even if you decide not to pre-warn customers, give your support and customer success teams as much information as possible about what’s happening. It’s effort up front but if you’re in the middle of an incident and the support team can handle customers’ questions without your input, you’ll be grateful.
What does success look like?
Having a clear understanding of the metrics for success will make the end of the cutover smoother - perhaps there are some SLAs you can monitor or tests that you can run. There’s nothing more embarrassing than sending the ‘we cutover and everything was fine’ email, just to find a couple of hours later that something was in fact broken. It’s also useful to identify issues quickly, particularly in a multi-stage cutover where it might be easier to resolve the issue if you find it straight away.
As well as communicating, there’s lots of pre-work you can do in the run up to a cutover. The aim of the game here is to de-risk the cutover, so anything you can do before the day itself is another thing ticked off the list (and another thing that can’t go wrong). Examples of this include:
- Creating relevant rows in a new table (provided the data won’t change).
- Dry-running whatever you can in a test environment.
- Developing dashboards or other tools that will help monitor your progress.
- Running queries to identify edge cases that may need manual intervention (think about what assumptions your migration code makes, and find the outliers.
4. On the day
Again, the key here is planning.
- Give yourselves plenty of time and space. Don’t rush unless you have to, and try to avoid multi-tasking where possible.
- Have a runbook that includes everything that needs to get done on the day, and try to keep it up-to-date if the plan changes.
- Focus on communicating so that everyone knows what the current status is. This will also help resolve issues if they arise.
- Identify clear roles and responsibilities, ideally including a ‘commander’ co-ordinating the whole thing.
- Use go/no-go checklists to help discussions about whether to proceed with a change.
- Understand the various rollback points (and have a strategy, ideally that you’ve tested). Also call out the ‘point of no return’ if there is one.
5. Aftercare / Post Go-Live Support (PGLS)
The highest risk of a cutover is usually just after you’ve made the change - surviving first contact with production is hard - so after-care is important, both in the short term (hours) and the medium term (days or weeks)
- Stay alert: use your observability stack to assess general system health, and have someone with detailed knowledge of the change available to help triage any new issues.
- Review success indicators: use the metrics you determined in your planning phase to understand if the change is working as expected.
- Post mortem: review the cutover as a team, identify lessons learned (and any actions to mitigate in future). You may also want to gather feedback from other stakeholders (e.g. support or customers) as inputs to this.