Jekyll2021-10-03T18:10:16+00:00https://paprikati.github.io/feed.xmlPaprikatiEngineering blog from paprikatiLisa Karlin Curtis⚓ Anchor Logs2021-10-03T00:00:00+00:002021-10-03T00:00:00+00:00https://paprikati.github.io/2021/10/03/anchor-logs<p>You probably use logs for debugging all the time (if not, you should). Many setups now use structured logging, enabling great exploration and aggregation tooling. There’s also probably a handful of engineers in your organisation who seem to have special powers when it comes to logs: in the heat of an incident, they can reliably find the smoking gun that <em>bit</em> faster. I don’t want to take anything away from them, they’re probably awesome engineers who are particularly skilled and practised at debugging. But wouldn’t it be great if all engineers could navigate the most important logs with that speed and accuracy?</p>
<h1 id="1-basic-logging">1. Basic Logging</h1>
<p>Logs are a really useful debugging tool. As the natural descendant of <code class="language-plaintext highlighter-rouge">console.log</code> debugging, my first web app printed loglines that looked something like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>11:37:29.827 [web] INFO initialising app
11:38:21.342 [web] INFO listening on Port: 3000
11:38:23.917 [web] INFO received API request /api/auth/login
11:38:24.012 [web] INFO returned 200 for request /api/auth/login
</code></pre></div></div>
<p>While definitely better than having no logging, this isn’t exactly ‘state of the art’.</p>
<h1 id="2-structured-logging">2. Structured Logging</h1>
<p>As soon as you start wanting to do any kind of aggregate analysis, or searching and filtering, the unstructured logs become difficult to manage. Many companies now use structured logs, usually JSON, that look something like:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"@timestamp"</span><span class="p">:</span><span class="s2">"2021-10-01T11:37:29.827Z"</span><span class="p">,</span><span class="w">
</span><span class="nl">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"api_request.received"</span><span class="p">,</span><span class="w">
</span><span class="nl">"ip"</span><span class="p">:</span><span class="w"> </span><span class="s2">"8.8.8.8"</span><span class="p">,</span><span class="w">
</span><span class="nl">"level"</span><span class="p">:</span><span class="s2">"INFO"</span><span class="p">,</span><span class="w">
</span><span class="nl">"msg"</span><span class="p">:</span><span class="s2">"received API request /api/auth/login"</span><span class="p">,</span><span class="w">
</span><span class="nl">"path"</span><span class="p">:</span><span class="s2">"/api/auth/login"</span><span class="p">,</span><span class="w">
</span><span class="nl">"request_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"abc-123"</span><span class="p">,</span><span class="w">
</span><span class="nl">"user_agent"</span><span class="p">:</span><span class="s2">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>This is a massive improvement on our unstructured strings! We can now start to analyse these logs using computers as they’re more easily parseable - and we don’t need to write mad <code class="language-plaintext highlighter-rouge">grep</code> commands for the privilege. We can throw these into (for example) a big ElasticSearch cluster (like Kibana does) and do searching, filtering and aggregation. I’d highly recommend enforcing that all loglines are structured, and have:</p>
<ol>
<li>a human readable message (<code class="language-plaintext highlighter-rouge">msg</code> in our example) to help with narrative</li>
<li>a machine readable key (in this case <code class="language-plaintext highlighter-rouge">event</code>) which can easily be filtered and also searched in the codebase.</li>
</ol>
<p>As you work on a codebase for a while, and debug lots of different issues, you’ll find that there are a handful of loglines you use more than the rest. Often there’s a logline with the duration and status of an HTTP request, which is really useful when looking for patterns across your API. There’s also likely to be a logline fired when you enqueue, begin or complete a job, which is great when looking at queue latencies or trying to work out which jobs ran on a particular day. In my experience, these emerge naturally as a ‘happy accident’, and knowledge of them lives solely in the minds of the engineers doing the debugging. But we can do better.</p>
<h1 id="introducing-anchor-logs">Introducing: Anchor Logs</h1>
<p>Let’s start with a premise: every unit of work should have a single Anchor Log, which has as much relevant information as possible. A unit of work is likely to be either a request from the outside world (HTTP / RPC request) or a job (triggered from a queue, an event or a scheduler like a cron).</p>
<p>You’ll want all the information, including how long the unit of work took, so you probably want to emit this right at the end of the request/job.</p>
<p>Our API logline might look like:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"@timestamp"</span><span class="p">:</span><span class="s2">"2021-10-01T11:37:29.827Z"</span><span class="p">,</span><span class="w">
</span><span class="nl">"duration"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.234</span><span class="p">,</span><span class="w">
</span><span class="nl">"endpoint_service"</span><span class="p">:</span><span class="w"> </span><span class="s2">"auth"</span><span class="p">,</span><span class="w">
</span><span class="nl">"endpoint_method"</span><span class="p">:</span><span class="w"> </span><span class="s2">"login"</span><span class="p">,</span><span class="w">
</span><span class="nl">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"anchor.api_request"</span><span class="p">,</span><span class="w">
</span><span class="nl">"ip"</span><span class="p">:</span><span class="w"> </span><span class="s2">"8.8.8.8"</span><span class="p">,</span><span class="w">
</span><span class="nl">"level"</span><span class="p">:</span><span class="s2">"INFO"</span><span class="p">,</span><span class="w">
</span><span class="nl">"http_method"</span><span class="p">:</span><span class="w"> </span><span class="s2">"POST"</span><span class="p">,</span><span class="w">
</span><span class="nl">"msg"</span><span class="p">:</span><span class="s2">"processed API request /api/auth/login"</span><span class="p">,</span><span class="w">
</span><span class="nl">"organisation_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"O123"</span><span class="p">,</span><span class="w">
</span><span class="nl">"path"</span><span class="p">:</span><span class="s2">"/api/auth/login"</span><span class="p">,</span><span class="w">
</span><span class="nl">"request_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"abc-123"</span><span class="p">,</span><span class="w">
</span><span class="nl">"status"</span><span class="p">:</span><span class="w"> </span><span class="mi">200</span><span class="p">,</span><span class="w">
</span><span class="nl">"user_agent"</span><span class="p">:</span><span class="s2">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"</span><span class="p">,</span><span class="w">
</span><span class="nl">"user_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"U123"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>And our job logline might look like:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"@timestamp"</span><span class="p">:</span><span class="s2">"2021-10-01T11:37:29.827Z"</span><span class="p">,</span><span class="w">
</span><span class="nl">"duration"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.234</span><span class="p">,</span><span class="w">
</span><span class="nl">"enqueued_at"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2021-10-01T11:33:22.706Z"</span><span class="p">,</span><span class="w">
</span><span class="nl">"event"</span><span class="p">:</span><span class="w"> </span><span class="s2">"anchor.job_worked"</span><span class="p">,</span><span class="w">
</span><span class="nl">"job_args"</span><span class="p">:</span><span class="w"> </span><span class="s2">"[</span><span class="se">\"</span><span class="s2">U123</span><span class="se">\"</span><span class="s2">, true]"</span><span class="p">,</span><span class="w">
</span><span class="nl">"job_class"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Jobs::DoSomeWork"</span><span class="p">,</span><span class="w">
</span><span class="nl">"job_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"J123"</span><span class="p">,</span><span class="w">
</span><span class="nl">"level"</span><span class="p">:</span><span class="s2">"INFO"</span><span class="p">,</span><span class="w">
</span><span class="nl">"msg"</span><span class="p">:</span><span class="s2">"worked job DoSomeWork"</span><span class="p">,</span><span class="w">
</span><span class="nl">"organisation_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"O123"</span><span class="p">,</span><span class="w">
</span><span class="nl">"queue"</span><span class="p">:</span><span class="w"> </span><span class="s2">"default"</span><span class="p">,</span><span class="w">
</span><span class="nl">"result"</span><span class="p">:</span><span class="w"> </span><span class="s2">"success"</span><span class="p">,</span><span class="w">
</span><span class="nl">"started_at"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2021-10-01T11:35:18.162Z"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>But, I hear you ask, why have you written a whole blog post about these? And given them a stupid name?</p>
<p>Well, because they’re <em>super useful</em>, and knowing you’ve got them means you can double down and get even more value from them.</p>
<h2 id="-debugging">🐛 Debugging</h2>
<p>When you’re debugging an issue, it’s really useful to have an entrypoint into ‘what happened in this time period’ or ‘what did this user do’. Anchor logs are a great answer here: if you search <code class="language-plaintext highlighter-rouge">event:anchor.*</code> in your log explorer of choice, you can filter to a time period and perhaps an organisation/user and see at a glance all the key things that happened.</p>
<h2 id="-onboarding">🛳 Onboarding</h2>
<p>When you become familiar with a codebase, you start to learn how to find the key log lines, and roughly what shape they are (i.e. should I be filtering on <code class="language-plaintext highlighter-rouge">status:500</code>, or <code class="language-plaintext highlighter-rouge">http_status:500</code>, or <code class="language-plaintext highlighter-rouge">http_code:500</code>). If you have a really small number of anchor logs, and they’re documented well, it’s much faster for a new joiner to get up to speed and become a debugging wizard. Navigating your logs stops being a dark art that has only been mastered by an elite few. If everyone is debugging in the same way, it becomes much easier to communicate during the heat of an incident and allows you to invest in tooling for these specific code paths (e.g. pre-defined links to certain useful queries and visualisations).</p>
<h2 id="-aggregation--analysis">📊 Aggregation & Analysis</h2>
<p>While metrics are the ideal tool for aggregation and analysis, often I’ve ended up analysing logs as we haven’t had the metrics in place for the time period we’re considering. This is particularly common in incidents where:</p>
<p>(a) something very unexpected has happened and</p>
<p>(b) it’s probably happened recently so well within your log retention window.</p>
<p>When analysing logs, you’re generally faced with a much more limited query capability that doesn’t allow you to easily join information from different sources. Having anchor logs that are information rich allows you to easily answer questions like ‘are all the 422s related to one organisation’ or ‘were all the failed jobs enqueued after a particular timestamp (perhaps a deploy?)’.</p>
<h1 id="putting-it-into-practice">Putting it into practice</h1>
<p>Making anchor logs work shouldn’t be too big an up front investment, and it’s something you can iterate and expand over time. The likelihood is that you already have something similar that you can use as a baseline and expand upon.</p>
<h2 id="ℹ️-get-all-the-information">ℹ️ Get all the information</h2>
<p>To make it easier to find the loglines you want, either looking for a specific user journey or for aggregation, you need to have lots of information on each log line. This is likely to include customer identifiers like <code class="language-plaintext highlighter-rouge">organisation_id</code> and <code class="language-plaintext highlighter-rouge">user_id</code> (in our example above). These are likely not to be immediately available wherever you log the anchor log (which, to get the duration, has to be right at the outer layer of your handler) so you’ll need to write some kind of abstraction to stash that information when you get it (e.g. in your authentication middleware) and pull it out when you log. You also want all the relevant timestamps so you can filter by those (e.g. <code class="language-plaintext highlighter-rouge">enqueued_at</code>) - generally speaking the cost of adding fields should be low, as long as they’re named sensibly. Try to keep consistent conventions across the different anchor logs: use the same key for your <code class="language-plaintext highlighter-rouge">duration</code> field across all of them, so no-one has to remember which one is which. (P.S. also use the same unit!)</p>
<p>You can read more about the benefits of wide events in this <a href="https://twitter.com/mipsytipsy/status/1009541219729281024?lang=en">twitter thread</a> from Charity Majors, CTO of <a href="https://www.honeycomb.io/">Honeycomb</a>.</p>
<h2 id="-documentation">📄 Documentation</h2>
<p>Write up documentation of what anchor logs are, and what your ones look like. If this can sit alongside the code that’s emitting the logs and keep the docs in line, that’s an added bonus. This then becomes the perfect onboarding material for new joiners struggling to penetrate the large pile of logs they are faced with when they get given their Kibana login (or equivalent). You can also build useful queries and visualisations for people to piggy-back on, making debugging even easier and faster.</p>
<h2 id="-track-down-those-missing-logs">🕵 Track down those missing logs</h2>
<p>Your logging systems are unlikely to have 100% retention guarantees (often there’ll be ‘best effort’ components), but you want to make sure you’re getting the vast majority of your logs. Wondering ‘did we drop the log’ is added complexity which no-one needs when debugging an issue. I’ve seen people claim ‘logs are bit lossy’ as an explanation before really investigating any other alternatives. If you encounter more than one missing log in a debugging context, you’re really unlucky. Take the time to investigate other possibilities - there’s often a component shouting at you if you look hard enough. A few possible failure modes I’ve encountered are:</p>
<ul>
<li>Application logic where logs are not issued in certain cases (e.g. only logging for jobs that explicitly call a ‘logging’ function)</li>
<li>Situations where logs are never issued (e.g. if the request is timed out by the infrastructure layer)</li>
<li>Exceeding a single logline size limit, so something is truncating and mangling logs to make them unfindable, or dropping them all together</li>
<li>Exceeding a buffer size, so something is dropping excess logs</li>
<li>Falling foul of a schema somewhere - did someone send a duration as an int and now your log ingester discards everything where duration is a float?</li>
<li>Pods being killed before they have time to get their logs somewhere persistent</li>
</ul>
<p>If you have a low tolerance for missing logs, you will likely find and fix these issues and you can continue to rely on the presence of your anchor logs in almost all cases.</p>
<h1 id="when-should-i-use-anchor-logs">When should I use anchor logs?</h1>
<p>Any unit of work - an incoming request or async work (i.e. a job) should definitely have an anchor log. There are a few other places that you might want to have an anchor log, depending on your architecture. Having an anchor log when emitting an event can be really useful, or perhaps having one for every third party API call. The key here is keeping the set as small as you can, while providing maximum value. If there are too many, engineers can’t keep track of them and they cease to be useful. It’s also important to keep them consistent - if the incoming HTTP request logs look pretty similar to the outbound ones, that’s less information an engineer has to remember to be able to easily navigate the logs. You’ll also need to think about your logging infrastructure and storage costs: issuing and storing a log isn’t free, and so there may be situations where you need to apply sampling techniques or lean more heavily on metrics.</p>Lisa Karlin CurtisYou probably use logs for debugging all the time (if not, you should). Many setups now use structured logging, enabling great exploration and aggregation tooling. There’s also probably a handful of engineers in your organisation who seem to have special powers when it comes to logs: in the heat of an incident, they can reliably find the smoking gun that bit faster. I don’t want to take anything away from them, they’re probably awesome engineers who are particularly skilled and practised at debugging. But wouldn’t it be great if all engineers could navigate the most important logs with that speed and accuracy?Complexity and Risk2021-09-27T00:00:00+00:002021-09-27T00:00:00+00:00https://paprikati.github.io/2021/09/27/complexity<p>In software, complexity comes from multiple components interacting with each other in different ways. We can imagine that all code is constructed from boxes which take some data as an input, do something, and provide data as an output. The more of these boxes we have, and the more logic we have connecting them, the more complex our system becomes.</p>
<p>It’s difficult to provide value without adding complexity. Imagine that we started a company which provided an API which would accept an English string and a target language and return a translation. There’s hidden complexity in this product: we need to have infrastructure to be able to serve these requests, perhaps some validation to make sure they’ve provided a valid string with characters that we recognise. We might need some authentication so that only people who have paid for the service can have access. Quickly, our simple product becomes something much more complex.</p>
<p>The aim of a project is often to add or enhance a feature. This usually means adding complexity to the system. As you scale your engineering team, and that team has time to build more features, the complexity of the product will continue to grow.</p>
<h1 id="complexity--risk">Complexity = Risk</h1>
<p>As your systems become more complex, the risk that something will go wrong increases.</p>
<p>Let’s get back to our translation start-up. Let’s say we wanted to add a feature so our API accepted a <code class="language-plaintext highlighter-rouge">source_language</code> parameter as well as <code class="language-plaintext highlighter-rouge">target_language</code>. So we diligently add the feature so that you can now specify the <code class="language-plaintext highlighter-rouge">source_language</code>. Ignoring the actual translation logic (let’s assume we are using google translate in the background), we’ve added many different ways our code can fail:
What if someone provides an invalid <code class="language-plaintext highlighter-rouge">source_language</code>?
What if the <code class="language-plaintext highlighter-rouge">source_language</code> is the same as the <code class="language-plaintext highlighter-rouge">target_language</code>?
What if the text isn’t in the <code class="language-plaintext highlighter-rouge">source_language</code>?</p>
<p>Similarly, imagine our translation start-up was getting lots of traffic, so we decided we’d like to add a load balancer to support multiple back-ends to serve all our requests. This is now a whole new service that could have an outage; even if we can keep the back-ends always alive and ready to serve requests, that won’t matter if the load balancer is dead.</p>
<p>As the systems become more complex, it becomes more difficult for engineers to reason about all the edge cases and failure modes of the various components. That means engineers make more mistakes, and we have more problems. These problems can take many forms, from service degradations or downtime through to business logic errors (charging someone the wrong amount of money) or security issues (allowing someone to see something they shouldn’t be able to see).</p>
<p>It also becomes more difficult to fix things when they do break. As our systems grow more complex, we experience outages where multiple components fail at the same time, in a new and unexpected way, which can be notoriously hard to debug. Imagine our translation start-up had decided to add both a load balancing service and also an authentication service. Someone has called our support team saying that they are receiving 503 Gateway Timeout errors from our API. We now have to look at each service in turn to identify where the problem even is, before we can start thinking about resolving it.</p>
<h1 id="not-all-complexity-is-equal">Not all complexity is equal</h1>
<p>There are certain types of complexity that we should be particularly careful about.</p>
<h2 id="️-distributed-complexity">🕸️ Distributed Complexity</h2>
<p>Back to our start-up. We’ve decided that we want to introduce a two-tier pricing system. People on the cheaper tier (tier 1) can only translate English strings, but those on the more expensive tier (tier 2) can specify the <code class="language-plaintext highlighter-rouge">source_language</code>. We also want to allow tier 2 customers to access a new thesaurus endpoint which would return multiple alternatives for the provided word.</p>
<p>This doesn’t sound too difficult for us to implement. In the most naive approach, we would (in pseudo-code):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// In the translation endpoint
if (request.customer_tier == 1 && source_language != 'en'){
return 'Forbidden'
}
// In the thesaurus endpoint
if (request.customer_tier == 1) {
return 'Forbidden'
}
</code></pre></div></div>
<p>We now have two different places where the code needs to understand (and check) that we have multiple pricing tiers, resulting in <strong>distributed</strong> complexity.</p>
<p>The best way to manage this kind of risk is to isolate complexity into its own component and wrap it in a single clean interface. We could have a bit of code shared across all endpoints (often called <code class="language-plaintext highlighter-rouge">middleware</code>) whose job it is to check the customer’s tier and then apply a set of rules about whether they can take the action or not. This would mean that our complexity is no longer distributed: there is a single bit of our code which knows about this business logic, and no-one else needs to. Our <a href="TODO">permissions project</a> at GoCardless is a great case study of this kind of project.</p>
<h2 id="️-multiplicative-complexity">✖️ Multiplicative Complexity</h2>
<p>Our start-up is using a third party to provide the thesaurus endpoint, but they aren’t licensed to operate in Germany. That means that even if a customer is on tier 2, we still can’t allow them to use the thesaurus endpoint if they’re using a German IP address.</p>
<p>Again, this doesn’t sound too difficult. We can get our load balancer to tell us which country any given request comes from. Then we just need to edit our logic above:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// In the thesaurus endpoint
if (request.customer_tier == 1 || request.country = ‘Germany’) {
return 'Forbidden'
}
</code></pre></div></div>
<p>When we combine that with the two-tier pricing system, we now have 4 different versions of the endpoint that we are supporting: German tier 1, German tier 2, Other tier 1, Other tier 2. This is <strong>multiplicative</strong>, and as anyone who’s read the ‘rice and ches knows: when you start multiplying numbers together they get real big, real fast. It’s easy to incrementally add more and more branches to endpoints like this, until they become really difficult to reason about and debug.</p>
<p>Isolating complexity can also help us here: if our code to manage the customer tiers was nicely separated from the country-specific restrictions, this multiplication becomes less problematic. If none of the rest of the code is worrying about which tier the customer is on, there are fewer scenarios to test and reason about when implementing the German restriction.</p>
<h2 id="️-opaque-complexity">🕶️ Opaque Complexity</h2>
<p>Our start-up has so much traffic, that we’ve decided we can no longer scale our MySQL server to handle our data so we’re moving to Cassandra. We’ve not used Cassandra, but it’s a popular product and we’re not doing anything too unusual, so we’re confident we can manage it.</p>
<p>As experienced by <a href="https://monzo.com/blog/2019/09/08/why-monzo-wasnt-working-on-july-29th">Monzo in 2019</a>, Cassandra won’t always behave in the way that you expect. The thing that makes this so dangerous is that the system is incredibly complicated that it becomes <strong>opaque</strong>: unless you are a Cassandra expert it is very difficult to get debugging information about why your Cassandra database isn’t behaving as expected. Using 3rd party technologies is often the right decision (as against building things yourself), but for highly complex and critical systems it carries significant risk.</p>
<p>One solution to this would be to build up more Cassandra expertise (either in-house or by hiring). An alternative approach would be to procure a managed version of the software (there are many to choose from), thus hopefully entrusting your system to a more experienced group of engineers for whom the software is less opaque. The other answer is to use more ‘out-of-the-box’ style technologies, which require less careful configuration and tuning, at least for as long as you can.</p>
<h1 id="managing-risk">Managing Risk</h1>
<p>As a product grows, the complexity (and therefore the risk) grows alongside. We can think about this as taking out debt, and using our talented engineers to pay the interest.
We pay the interest in two ways:</p>
<ol>
<li>It takes us longer to build and ship new features because we need to gain confidence that our systems work as expected.</li>
<li>Things go wrong in production, and it takes us time and money to resolve the issue and compensate the affected customers.</li>
</ol>
<p>Over time, paying off this interest can slow your product organisation almost to a complete halt. It’s important for engineers (particularly those in leadership roles) to be able to communicate clearly about the risks that the product is carrying, and identify potential projects to help reduce those risks (reducing the interest we need to pay).</p>
<p>We’ve already discussed some of them: isolating complexity and building expertise are both great strategies for reducing complexity. It’s also important to constantly trim unused or unstrategic functionality, which is ubiquitous across most start-ups. The functionality doesn’t appear to have a high maintenance cost, so sits untouched in the code base and slowly rots. If you have 10,000 customers and only 12 of them use the dictionary endpoint, unless it is deeply strategic, it’s a good candidate to be removed. Alternatively, it might be a configuration option that is rarely used or a workaround for a bug that’s since been fixed. The engineering effort required here is usually pretty limited as it’s often simply deleting code, but there may be additional effort required to communicate the change to customers. Having a well-trodden path that makes removing functionality easy is an important part of keeping complexity under control. This is sometimes described as ‘weeding’ - removing the functionality you don’t want to give space for new functionality to grow up in its place.</p>
<h1 id="testing-testing-1-2">Testing, Testing, 1, 2</h1>
<p>Projects to make changes like isolating complexity or removing unnecessary features can be quite scary. That fear can result in teams leaving lots of code untouched for years on end, for fear they will break someone’s integration. Tests are our best weapon in this particular battle. If we want to be able to safely and easily refactor and isolate complexity, having robust integration / end-to-end tests which demonstrate that the behaviour of the system remains unchanged is critical.</p>
<h1 id="wrap-up">Wrap-Up</h1>
<p>We cannot build useful products without complexity, so avoiding complexity all together is clearly not a useful strategy. But we should view all new features as a trade-off between the value they provide and the risk they carry. It’s often worth questioning whether the desired customer outcome can be achieved in a less complicated way: perhaps by changing some existing business logic or changing the implementation details. We can use the strategies outlined above to manage our risk, allowing us to deliver value while keeping our systems robust and resilient. Happy coding!</p>Lisa Karlin CurtisIn software, complexity comes from multiple components interacting with each other in different ways. We can imagine that all code is constructed from boxes which take some data as an input, do something, and provide data as an output. The more of these boxes we have, and the more logic we have connecting them, the more complex our system becomes.LeadDev Live - Don’t bring a knife to a gun fight2021-09-14T00:00:00+00:002021-09-14T00:00:00+00:00https://paprikati.github.io/2021/09/14/selling-investment<p>Engineers often complain that “we don’t do enough investment”. We leave code to rot over many years; accidents waiting to happen or mud that slows down our shipping speed.</p>
<p>Engineers go into battle with product managers, arguing for investment over new features, and we lose. Over and over again. And when you look at the different backgrounds and skill sets - perhaps that’s not surprising. As an ex-consultant myself, I can confirm that persuasion is the name of the game. And most engineers don’t get taught to persuade at all.</p>
<p>This isn’t sensible - unless the engineers are hugely over-estimating the benefit of doing the investment work, the organization isn’t making good choices. And the battle mentality is clearly counter-productive: driving barriers between two groups whose entire job description is about collaboration.</p>
<p>We need to fix this, and we can. This talk will help you understand and sell the value of investment work so that you can collaborate with non-engineers to get it prioritized.</p>
<p><img src="/assets/img/leaddevlive.png" width="800px" /></p>
<p><a href="https://leaddev.com/leaddev-live/dont-bring-knife-gun-fight-selling-value-investment" target="blank" class="call-to-action">Go to LeadDev Live</a></p>Lisa Karlin CurtisEngineers often complain that “we don’t do enough investment”. We leave code to rot over many years; accidents waiting to happen or mud that slows down our shipping speed.Breaking The Monolith (Bonus)2021-08-05T00:00:00+00:002021-08-05T00:00:00+00:00https://paprikati.github.io/2021/08/05/breaking-the-monolith-bonus<p>When breaking the GoCardless monolith, we found a really interesting problem caused by our new multi-service architecture. I couldn’t find a home for it in the <em>Breaking the Monolith</em> series, so it’s here in a little bonus blog.</p>
<h1 id="the-problem">The Problem</h1>
<p>GoCardless talks to banks, and banks (specifically Direct Debit schemes) run lots of daily processes. That means that GC also has many things that happen once a day, and does lots of batch processing in what we called a pipeline. Each collection currency has its own pipeline, and each pipeline has multiple steps. For the pipelines to be successful, we needed to be sure step 1 was completed before embarking on step 2 (and so on). We kept coming up against the same problem: if the monolith emits a bunch of events, how can it be sure that they’ve all been processed by some other system before it proceeds to the next step of the pipeline.</p>
<p><img src="/assets/img/btm_events1.png" width="800px" /></p>
<h1 id="solution-1-slos">Solution 1: SLOs</h1>
<p>We have a Service Level Objective (SLO) set up meaning that we are alerted if a tiny % of events are not processed within the required time frame (say 5 minutes). Perhaps we can rely on this SLO, and tell our pipeline code to wait for 5 minutes after emitting the events before moving on to step 2? This means we have to <em>really</em> trust that our other service will meet the SLO. We also have a new and scary failure scenario, which means we have to build more tooling. If there’s a service degradation and the engineer gets alerted 4.5 mins into our 5 minute wait period, they only have 30 seconds to stop the pipeline before it moves on to step 2 (and proceeds to break). So maybe we need a new component which monitors the SLO and then stops the pipeline running if the system looks like it’s degraded. That sounds … complicated.</p>
<p><img src="/assets/img/btm_events2.png" width="800px" /></p>
<p>It also sounds a bit risky. It puts a lot of pressure on our metrics code to precisely measure this SLO, which given that we also want that code to be incredibly performant, is a bit concerning. This solution is also quite slow - we’re now waiting 5 minutes when the work will usually be completed in 1. But the real showstopper was: what happens if one of our events was in the 0.001% not covered by our SLO, so our alerts didn’t fire?</p>
<h1 id="solution-2-look-at-the-queue">Solution 2: Look at the Queue</h1>
<p>Another way of proving that all the events had been processed would be to look at the queue (i.e. the events waiting to be processed). If the queue is empty, then all the events must have been processed. For us, there were a couple of problems with this. Firstly, this would mean we had to have one queue for each pipeline (there are 8), and all other events would need to use a different queue. This would have expanded our number of Pub/Sub topics and subscriptions from 1 to at least 9, and again significantly increased the complexity of our system. The real problem, though, is that there wasn’t an easy way for us to query the queue - it’s not really something systems like Pub/Sub are set up to do. That tooling is all aimed at a <strong>metrics</strong> use-case, which provides different guarantees to the ones we were looking for.</p>
<p><img src="/assets/img/btm_events3.png" width="800px" /></p>
<h1 id="solution-3-event-counting">Solution 3: Event Counting</h1>
<p>When we start a pipeline we assign it a unique ID called a PipelineID. When the pipeline emits events, it can stamp them with a PipelineID. It can also count the total number of events that it has emitted, and store this number. Our service can process the events, and also count the number of events that it has processed against a particular PipelineID. This requires a bit of thinking about transactional guarantees (we have multiple event processors which could be simultaneously updating the counter) but it’s very achievable in most databases. Our service then exposes a synchronous endpoint which returns the associated event count for a given PipelineID. This means that the monolith can poll the service to get the <code class="language-plaintext highlighter-rouge">processed_event_count</code> and compare it to the <code class="language-plaintext highlighter-rouge">expected_event_count</code> that it’s stored when emitting the events. Once these numbers are the same, the pipeline can be 100% sure that all the events have been processed, and can continue to the next step.</p>
<p><img src="/assets/img/btm_events4.png" width="800px" /></p>
<p>It’s possible (but quite fiddly) to implement this without a significant performance penalty: this kind of counting is what databases are built to do. The polling might seem clunky, but it’s incredibly cheap and pretty simple to debug when things go wrong. There’s no additional infrastructure complexity, and we’re not twisting someone else’s technology to do something it’s not intended for.</p>
<h1 id="happily-ever-after">Happily ever after?</h1>
<p>The truth is, while we’re content in a ‘engineer who solved a problem’ way with our solution, we’re not particularly happy with the overall architecture. It feels a bit like an antipattern: we’re having to force our multi-service architecture to do something unnatural. What we’d like is to be able to rely on SLOs, have genuine eventual consistency, and then have corrective actions which we can take if our services degrade. It’s OK that this isn’t perfect: in order to break the monolith, we have to be willing to make these kinds of pragmatic choices. Code doesn’t last forever, and it shouldn’t need to. If we’re too ideological, it’ll create too many cross-project dependencies and struggle to make progress breaking our monolith. So we’ve abandoned <em>purity</em>, and favoured <em>pragmatism</em> instead.</p>Lisa Karlin CurtisWhen breaking the GoCardless monolith, we found a really interesting problem caused by our new multi-service architecture. I couldn’t find a home for it in the Breaking the Monolith series, so it’s here in a little bonus blog.Breaking The Monolith III2021-07-16T00:00:00+00:002021-07-16T00:00:00+00:00https://paprikati.github.io/2021/07/16/breaking-the-monolith-3<p>One of the biggest decisions we need to make when breaking a monolith is: how will my services talk to each other? In a monolith, communication is synchronous by default; now that we have multiple services we need to decide what to use.</p>
<div class="aside">
This is post 3 of a series; I’d recommend reading at least the introduction of <a href="/2021/07/01/breaking-the-monolith-1.html">post 1</a> before continuing to explain the scenario.
</div>
<p>As discussed in <a href="/2021/07/09/breaking-the-monolith-2.html">post 2</a>, our webhook service needs to expose a synchronous endpoint to enable the monolith to create and change webhook endpoint configuration (using a 2-phase commit). This means that our monolith needs to be able to send synchronous messages to the webhook service.</p>
<p>We also need async communication: our monolith needs to emit <code class="language-plaintext highlighter-rouge">employee_created</code> events which our webhook service can subscribe to so that it is eventually consistent. This allows the monolith to serve requests even if our webhook service is down.</p>
<p>We want to choose two technologies, one for sync and one for async communication, which can be used throughout the product. This will allow teams to share tooling and make integrating new services with each other more straightforward, and also spread learnings across the engineering teams. The technologies involved are very flexible, so until we come across a problem that cannot be solved by the ones we’ve already chosen, we want to stick with them.</p>
<h1 id="synchronous-communication">Synchronous Communication</h1>
<p>The main choice here is between REST and RPC. RPC has different performance characteristics which may make it more appropriate for particular use cases, but make sure to consider the development experience including testing and debugging before committing to a particular technology. For us, it’s an easy choice to use REST. We wouldn’t benefit from the performance characteristics of RPC and we’re already familiar with the fantastic REST tooling from the OpenAPI project. Additionally, using RPC in a language that isn’t strongly typed is not an experience I’d like to repeat.</p>
<h1 id="asynchronous-communication">Asynchronous Communication</h1>
<p>The big name technologies for async communication are Kafka, AWS Kinesis, and Google Pub/Sub. They’re all very robust and while there are subtle functionality differences they will all solve most use cases. We choose Google Pub/Sub, because the rest of our architecture is built on Google Cloud Platform so we can plug into existing access management and infrastructure tooling.</p>
<div class="aside">
Terminology: Our monolith will publish <b>messages</b> to a <b>topic</b>. Our webhook service can then <b>subscribe</b> to that topic, to receive a stream of messages.
</div>
<p>Alongside Pub/Sub, to make our async communication infrastructure robust and resilient, we’ll need some more capabilities:</p>
<h2 id="-a-queryable-record-of-every-event-sent-to-a-particular-topic">❓ <strong>A queryable record</strong> of every event sent to a particular topic.</h2>
<p>This is very useful when trying to debug issues and is a prerequisite for being able to recover from outages (see below). We can set up a serverless function (using Google Dataflow) which also subscribes to our topic and pipes all the messages into a big-data store. We can then build a generic tool so that everyone using Pub/Sub can easily re-use our work, and engineers know where to find the record of events if needed.</p>
<h2 id="--the-ability-to-replay-history">📜 The ability to <strong>replay history</strong></h2>
<p>Imagine that our webhook service goes down and we need to restore it from a backup. We can run some code to pipe the data from our queryable record back onto the Pub/Sub topic since [x]. This will bring our webhook service back up-to-date. We don’t want to wait until something goes wrong to have to build this capability as it’ll likely be part of a wider incident response, and we will be in a hurry. Having this ready build, tested and documented is a big win.</p>
<h2 id="--the-ability-to-handle-unprocessable-events-so-that">❌ The ability to handle <strong>unprocessable events</strong> so that:</h2>
<ol>
<li>We are alerted <em>loudly</em> when an event cannot be processed (transient errors should be retried to avoid alert fatigue)</li>
<li>We are able to reprocess the event once the webhook service’s consumer code has been fixed so that the event is no longer unprocessable</li>
<li>We are able to ‘abandon’ the event (with manual intervention) once comfortable that the event cannot be processed (provided this is acceptable in the product context).</li>
</ol>
<h2 id="--the-ability-to-handle-misordered-messages">🔢 The ability to handle <strong>misordered messages</strong>.</h2>
<p>Imagine we decided to use asynchronous communication to handle updates to the webhook configuration. We still want the ‘last wins’ behaviour for our updates which we’d expect from a database. This has some advantages (e.g. our replay history tool would now work for these updates, as well as just events), but clear disadvantages (the API can’t know if the update has been successful, and we now have to solve this pesky message ordering issue).</p>
<p>There are a number of strategies to handle misordered messages, one of which is to use sequence numbers. For this, we’d need the monolith which is serving the API request to store an incrementing sequence number against the customer’s account which changes whenever the webhook configuration changes. The monolith could then stamp a <code class="language-plaintext highlighter-rouge">last_sequence_number</code> on each event to ensure that the webhook service processes them in order. If the most recently processed <code class="language-plaintext highlighter-rouge">sequence_number</code> does not match the <code class="language-plaintext highlighter-rouge">last_sequence_number</code>, then the webhook service knows that it missed a message and can temporarily reject this message. Our service now needs to alert when it has temporarily rejected a message too many times, and an engineer needs to be able to resolve the issue (either by re-emitting the event or manually incrementing the expected <code class="language-plaintext highlighter-rouge">sequence_number</code>).</p>
<p>For us, this is complexity that we could do without (and isn’t adding much value) so we are going to avoid async messages where ordering is important and sidestep this concern.</p>
<h1 id="interfaces--schemas">Interfaces & Schemas</h1>
<p>We should treat internal interfaces with the same care and attention as a publicly exposed API. To make it easy for other teams to use our service, we need to build some tooling to help us.</p>
<h2 id="-1-build-a-schema">🧱 1. Build a Schema</h2>
<p>The centre of our tooling is a schema, which is a static declaration of what we expect our interface to receive (and, for two-way communication, return). We are using JSON payloads, both for our sync REST APIs and for our async messages, so we choose to use <a href="https://json-schema.org/">JSON schema</a>. This allows us to use tooling such as <a href="https://swagger.io/docs/specification/about/">OpenAPI</a> and <a href="https://www.asyncapi.com/">AsyncAPI</a>, which both use JSON schema to define their payload structure.</p>
<h2 id="-2-validate-payloads">✅ 2. Validate Payloads</h2>
<p>Our next step is to build some lightweight tooling to allow us to validate payloads against this schema. We want the schema to be accessible by multiple services, so we create a small library with a wrapper that allows services to validate a payload. Luckily, there are multiple libraries that will validate a payload against a JSON schema for us, so we just wrap that library (plus our schema) in a nice interface. It’s safest for us to validate the payload ‘on both sides’: i.e. both in the monolith and the webhook service. This means we can run tests on a single service while still being confident that the payloads are valid.</p>
<p>It’s important to validate payloads in production, not just in tests, so that you find edge cases quickly. It also allows the rest of your code to trust that the schema has been enforced, and avoids lots of ‘if this is blank then raise’ boilerplate. This is particularly important for async communication: as we’ve discussed, cleaning up bad async messages is really messy, so anything that gives us confidence that the messages are correct (even just structurally) is a good thing. This means that you can be confident that your schema is correct: it’s not just a point in time document but something that is validated in production every day.</p>
<h2 id="-3-document-and-share">📄 3. Document and Share</h2>
<p>Now that we have a schema that is correct, we should generate some documentation so that other engineers can easily use our new interfaces. Again, there are a multitude of tools (e.g. JSON schema) that will generate a docsite from a schema, although this will only be as good as the descriptions and examples that you’ve written. Depending on the size of your team, you may want to include some narrative guides to accompany the raw payload structure documentation. We also need to host these docs somewhere that developers can find it: in our case we are using <a href="https://backstage.io/">Backstage</a> as a service catalogue and it makes sense for the interface docs to live there.</p>
<p>Another useful tool to help our engineers be productive is simple client libraries. We can add more features to our schema library so that we create a ‘Client Library’, just like we would for our public API. There’s great tooling for auto-generating OpenAPI client libraries, so we can simply use that. It generates code snippets, which also goes into our backstage documentation. This means that engineers from other teams can self-serve to use the interfaces we provide, and have confidence in their implementations.</p>
<h2 id="-4-breaking-changes">💔 4. Breaking Changes</h2>
<p>It’s impossible to define an interface on day 1 which will last forever. We know this: that’s why we stopped doing waterfall software projects. In this situation, a breaking change is any change to our interface which breaks an assumption made by another system (see <a href="/2021/05/31/euruko.html">How to Stop Breaking Other People’s Things</a> for a more detailed definition).</p>
<p>We want to know up front how we will handle breaking changes so that we can build both the tech and the processes to manage them. This comes back to one of our key costs of microservices: in a monolith you can deploy concurrently to multiple components as they belong to the same deployable unit. Once we split into services, everything has to be nicely backwards compatible to avoid downtime.</p>
<p>Breaking changes are a pain, and so of course the easiest path is to avoid them entirely. For example, we want to make sure we can add a new field to a payload without breaking everyone’s stuff. Our validation logic is nicely centralised, so we can simply ensure that the library ignores (and perhaps strips out) any unrecognised fields, rather than erroring. However, some changes will always be considered breaking (e.g. removing a field).</p>
<p>To support breaking changes, we need to add a version number to our schema. All events will be emitted with a <code class="language-plaintext highlighter-rouge">version</code> tag so that consumers know which schema should be applied. We also need our docs (and our schema library) to support multiple concurrent versions. We then need to define a process of ‘how do we make a breaking change’, both technically and organisationally.</p>
<p>In HR-Land, we used to have a boolean flag <code class="language-plaintext highlighter-rouge">is_contractor</code> which we used to identify contractors (as opposed to full time employees). We’ve now learned that we need more granular detail and have decided to add a new enum called <code class="language-plaintext highlighter-rouge">employment_type</code> (which wasn’t a breaking change). However, we don’t want to leave the boolean flag lying around as we can’t reliably derive it from our <code class="language-plaintext highlighter-rouge">employment_type</code>, and so any consumers that are using it might make bad choices.</p>
<p>Firstly, we release a new version of the schema. This should flow through to our docs (so the latest docs don’t have the <code class="language-plaintext highlighter-rouge">is_contractor</code> flag) and our library (so we release a new version that includes both the old schema and the new one).</p>
<p>Then, we give other teams a heads up that the field is being removed on a particular date, and ask them to upgrade to the newest version of the library and adapt their code so it can handle payloads both with and without the <code class="language-plaintext highlighter-rouge">is_contractor</code> flag.</p>
<p>On the pre-agreed date, our team starts sending messages based on the new schema (provided we’ve checked that none of the critical functionality will be impacted). This means that if a team hasn’t made the change in time, their service will start to error as the schema has changed. If these teams are consumers of async communication such as events, this may not be a problem: perhaps that service is deprecated or it’s an edge case which isn’t a priority for them.</p>
<p>There are clearly organisational challenges here as it requires co-ordinated work across multiple teams. That co-ordination isn’t cheap or easy, and it’s where you’ll find much of the hidden cost of multiple services.</p>
<h1 id="a-final-note-on-idempotency">A Final Note on Idempotency</h1>
<p>For the love of all that is holy, <strong>make everything idempotent</strong> or it will be very sad. When a request is idempotent, it means that if the caller makes the same request multiple times, the receiver will action the request exactly once, and return a consistent response. Without idempotency, many of our transactional guarantees fall into dust and you’ll have weird and hard-to-debug edge cases appearing all over the place. It also means things like being able to replay history just won’t work correctly as an instruction might be incorrectly actioned a second time. Some technologies are quite good at exactly once delivery, but only guarantee at least once delivery. We’ve chosen to explicitly check that our async consumers are idempotent by deliberately resending instructions and ensuring the results are as expected.</p>Lisa Karlin CurtisOne of the biggest decisions we need to make when breaking a monolith is: how will my services talk to each other? In a monolith, communication is synchronous by default; now that we have multiple services we need to decide what to use.Breaking The Monolith II2021-07-09T00:00:00+00:002021-07-09T00:00:00+00:00https://paprikati.github.io/2021/07/09/breaking-the-monolith-2<p>Transactional guarantees are one of the hardest challenges when breaking a monolith. They’re the key reason I’d recommend staying in a monolith as long as it is practicable.</p>
<div class="aside">
This is post 2 of a series; I’d recommend reading at least the introduction of <a href="/2021/07/01/breaking-the-monolith-1.html">post 1</a> before continuing to explain the scenario.
</div>
<p>We’ve decided to extract our webhook sending component into its own service. This component will be responsible for:</p>
<ol>
<li>Storing the webhook configurations for each customer (i.e. whether they receive webhooks, and to what endpoint)</li>
<li>Receiving <em>events</em> from the core system, grouping them into webhooks, and sending them</li>
<li>Storing the webhooks that we’ve sent so that customers can view them in the UI to help debug, and resend webhooks that they failed to process</li>
</ol>
<p>When we build a monolith on a big relational database, we take transactional guarantees for granted. We can view them as a kind of guard rail; developers use them so that they can avoid thinking about failure modes within a particular bit of logic. Breaking the monolith will require us to relinquish some of these guarantees.</p>
<h1 id="spot-the-transaction">Spot the Transaction</h1>
<p>If we don’t know where the transactions are, it’s hard to reason about what to do with them. Depending on the structure of our code, and how isolated the component already is, this may be quite a challenge. In this case, it might be a good candidate for isolating the component within the monolith first before trying to extract it.</p>
<p>Some transactional guarantees are there for convenience, but can be easily relinquished if needed. For example, we currently have a single endpoint which allows a customer to update their system configuration. This includes things like their colour scheme and logo, but also includes their webhook endpoint URL and secrets. Currently, the <code class="language-plaintext highlighter-rouge">update</code> call to change their colour scheme is wrapped in the same transaction as the webhook endpoint configuration. Maintaining this behaviour would be expensive (as we will see below), so instead we’ve decided to relinquish the guarantees here all together by splitting the API into two separate endpoints (one to update colour and logo, another to update the webhook configuration).</p>
<p>It’s helpful for us to make these changes up front so that we can confirm the behaviour and find any rough edges before we start worrying about multiple services. The more ruthless we can be with these transactions, the easier our journey to multiple services will become. We must think critically about the guarantees we currently provide to internal and external stakeholders, and robustly question how valuable they are. Any guarantees we want to keep will incur significant costs, so they had better be value for money.</p>
<h1 id="two-phase-commit-2pc">Two Phase Commit (2pc)</h1>
<p>Two Phase Commit is a common strategy employed by many databases to give transactional guarantees across multiple servers. This is what’s happening under the hood with all those clever multi-node databases. It uses a ‘prepare’ phase (which passes all the data and validates it) and a ‘commit’ or ‘rollback’ phase, giving the ‘all or nothing’ atomicity we expect from a database. It’s very robust, but is comparatively slow and relies on lots of synchronous communication.</p>
<p><img src="/assets/img/btm_2pc.png" width="400px" /></p>
<p>Depending on how we use 2pc, we run the risk of undermining some of the benefits we are hoping to get from breaking the monolith.</p>
<p>We want to use this so that we are really sure that every time a new employee is created, we send a webhook to the customer. We want to keep our transactional guarantee here because if we don’t send the webhook, it’s possible that someone won’t get added to the payroll which would be a really bad outcome for our customer.</p>
<p>A simple implementation of our use case might look something like:</p>
<ol>
<li>Open transaction with the main HR-World database</li>
<li>Create the new employee in the HR-World database table <code class="language-plaintext highlighter-rouge">employees</code></li>
<li>Send a <code class="language-plaintext highlighter-rouge">prepare</code> message to the webhook service with an <code class="language-plaintext highlighter-rouge">employee_created</code> event, and wait for <code class="language-plaintext highlighter-rouge">ack</code></li>
<li>Commit change in the HR-World database</li>
<li>Send <code class="language-plaintext highlighter-rouge">commit</code> message to webhook service to make sure the webhook gets sent (and really really hope that this doesn’t fail)</li>
</ol>
<p>The first problem that jumps out is what happens if step 4 or step 5 fails. To give us confidence in this process, we must either:</p>
<ul>
<li>Be comfortable with one commit succeeding and the other failing, at least temporarily (see eventual consistency below)</li>
<li>Be 100% certain that one of the commits won’t fail (perhaps provided we retry it a few times)</li>
</ul>
<p>Both of these are non-trivial, and create new failure modes which engineers have to understand and reason about.</p>
<p>This implementation also means that we are holding a transaction open with the main HR-World database while sending a message to our webhook service. This is considered an antipattern as the call to another service is likely to be (comparatively) slow, and we’re now making our HR-World database hold open a transaction for all of that time. Given that one of our concerns driving this project is relieving the pressure on the HR-World database, this could make our problems worse rather than better.</p>
<p>Another challenge is that if our webhook service is unable to accept requests, our monolith now cannot successfully create an employee; it will have to return an error as our <code class="language-plaintext highlighter-rouge">prepare</code> message will fail. This reduces the ‘blast radius reduction’ benefits of splitting the services as they are still too interreliant.</p>
<h1 id="eventual-consistency">Eventual Consistency</h1>
<p>Eventual consistency was the big brain idea of distributed systems architecture: what if we knew the event store would <em>eventually</em> catch up, even if it didn’t immediately update alongside our monolith. This is how async database replicas work, which might be used for analytics queries or periodic backups. Our webhook service is a good candidate for this. We don’t need the webhooks to be instantly created - we wait for a few events and group them together anyway, so the customer won’t notice if the event isn’t created instantly in our system provided the overall latency remains low. However, we’re still afraid of the situation where we create an employee but <em>never</em> send a webhook; this is something we need to avoid.</p>
<p>Enter the transactional outbox. For this, we need an <code class="language-plaintext highlighter-rouge">outbox</code> table in our monolith’s database, which can be accessed within a transaction alongside other tables in that database. This gets our ‘all or nothing’ guarantee that we expect from a transaction. We then need a process that pulls data from that table, and publishes it to our message broker. Provided this never fails, we’ve now ensured that our message will be sent without making a call outside our service.</p>
<p><img src="/assets/img/btm_outbox.png" width="1000px" /></p>
<p>There’s still a failure mode here. Imagine we created an employee record with a particularly long address. The <code class="language-plaintext highlighter-rouge">employee_created</code> event was inserted into the outbox, but when it was sent to the webhook service the service returned an error as it was unable to process this event. We either need to be able to roll back the action in the monolith (which would be surprising at best, and dangerous at worst) or to resolve whatever issue is preventing the event store from processing the message. Particularly if the product executes actions that cannot be reversed (e.g. moving money), this is something to consider carefully.</p>
<p>Our resolution path is likely to be either</p>
<ol>
<li>Alter the code in the webhook service so that it accepts longer addresses, or truncates them</li>
<li>Alter the monolith code so that it truncates the address when creating the event, <strong>and edit the existing incorrect event</strong></li>
</ol>
<p>Option 2 should probably be setting off alarm bells: changing history like this is pretty scary and dangerous. For example, perhaps other services are also consuming these events. Do we have to go back and change them too? What happens if we now replay the events? Does something assume that events with the same idempotency key will be identical? It’s likely that you can’t do this even if you want to: at this point the message is probably in a queue somewhere which isn’t mutable. Option 1 is almost always the right solution, which means we need to be rigorous to ensure the events that are inserted into the outbox are valid.</p>
<h2 id="how-long-is-eventually">How Long is Eventually?</h2>
<p>One of the aims of our project is to reduce the latency on webhooks. To do this, we first need to measure the latency (using a metrics system like <a href="https://prometheus.io/">prometheus</a>). We also define an SLO (service level objective) specifying that 99.9% of webhooks should be sent within 5 minutes. Once we have the metrics, we can alert (and respond) if we look to be in danger of missing our SLO. It’s useful to think in advance about what tools and options we have to manage a degradation so that when it inevitably occurs we are somewhat prepared.</p>
<h1 id="stateless-services">Stateless Services</h1>
<p>We can sidestep many these consistency problems if we can make a service <em>stateless</em> (i.e. without any persistent storage). These are sometimes described as ‘Serverless architectures’, which I consider a bit misleading but hey-ho. It’s unlikely we’ll be able to run the whole application in this way (unless it’s a fundamentally stateless product), but extracting chunks of business logic into these structures can be very worthwhile.</p>
<p>Looking at our requirements again:</p>
<ol>
<li>Storing the webhook configurations for each customer (i.e. whether they receive webhooks, and to what endpoint)</li>
<li>Receiving <em>events</em> from the core system, grouping them into webhooks, and sending them</li>
<li>Storing the webhooks that we’ve sent so that customers can view them in the UI to help debug, and resend webhooks that they failed to process</li>
</ol>
<p>It would be possible to make a stateless webhook service but all it could do would be to deliver requirement 2: we couldn’t store the webhook configuration or record what we sent. Requirements 1 and 3 would still need to be solved by the monolith, which would now have to expose new APIs to enable the webhook service to gather configuration and store webhooks.</p>
<p>In the long term, it might make sense to extract either the ‘grouping events into webhooks’ or the ‘actually sending a webhook’ part of our webhook service into something stateless which we could horizontally scale. However, our current focus is on removing pressure from the primary database by breaking up the monolith, so stateless services (which by definition tend not to take pressure off the primary database) aren’t something we’d want to prioritise.</p>
<p><em>Next Up: <a href="/2021/07/16/breaking-the-monolith-3.html">Inter-Service Communication</a>.</em></p>Lisa Karlin CurtisTransactional guarantees are one of the hardest challenges when breaking a monolith. They’re the key reason I’d recommend staying in a monolith as long as it is practicable.Breaking The Monolith I2021-07-01T00:00:00+00:002021-07-01T00:00:00+00:00https://paprikati.github.io/2021/07/01/breaking-the-monolith-1<p>It is a truth universally acknowledged that a scale-up in possession of a monolith must be in want of microservices.</p>
<p>This series of blog posts will tell the story of a fictional startup on its journey to break the monolith. It’s informed by a combination of my experience at GoCardless, and talking and reading about similar journeys in many other startups. In this first post, we’ll outline an approach to breaking up a monolith, and then some key considerations to make the migration a success.</p>
<h1 id="the-scenario">The Scenario</h1>
<p>We are a high growth startup which runs an HR platform for other businesses called HR-World. It currently runs on a monolith powered by a single Postgres database, which is starting to get a bit unwieldy. We’ve got 70 engineers split into various teams working on a combination of greenfield functionality and enhancing what we already have.</p>
<h1 id="monolith-or-microservices">Monolith or Microservices</h1>
<p>Monoliths are awesome. Having one has allowed us to develop new product features really quickly and test them easily. Microservices might be shiny, but they are hard work and incur a lot of overhead for engineers. In simple terms, they are more complex than a monolith, and complexity = risk. As with all interesting engineering decisions, we recognise that there’s a trade-off.</p>
<h2 id="the-cost-of-a-monolith">The Cost of a Monolith</h2>
<p>🏘️ <strong>If everyone owns something, no-one does</strong>. We’ve started having ownership problems, resulting in important tasks like framework upgrades becoming hard to prioritise. When something goes wrong, it takes time to work out which team needs to be informed or alerted. That has led to alert fatigue as one corner of our product isn’t very well maintained at the moment and causes lots of noise. We’re starting to see signs of ‘broken windows’ where small bugs are left to rot as no-one can agree who owns them.</p>
<p>💥 The <strong>blast radius</strong> of each incident in our product is becoming a problem: one team’s error can bring down the entire product, which requires everyone to be as cautious as if they owned the most business critical component. This is slowing down feature development and new joiner onboarding.</p>
<p>📈 Depending on data shape and storage engine, there may be <strong>scaling limits</strong> on the underlying database technology. The rate of growth in the amount of RAM required to prevent us having to do disk reads is unsustainable; we’ll soon be unable to get a big enough box. Recovery from incidents has become really slow due to data size: big databases take longer to restore, and there’s not a whole lot we can do to change that.</p>
<h2 id="the-cost-of-microservices">The Cost of Microservices</h2>
<p>We recognise that splitting our monolith won’t magically solve all of our problems.</p>
<p>💬 <strong>Cross-service interfaces</strong> are hard to define, and need to be changed in backwards compatible ways, adding a lot of developer time to each interface change (e.g. 1 PR to rename a field becomes 3 PRs). This can lead to cultural pressure to avoid these changes, leaving fossilised interfaces.</p>
<p>🤝 <strong>Transactional guarantees</strong> across services can be computationally expensive and difficult to reason about and debug. Consistency (which is almost free in a Monolith) becomes challenging, although this is a cost of <em>scale</em>, not just architecture: similar challenges come from partitioning a monolith database.</p>
<p>🧑💻 <strong>Developer experience</strong> is likely to suffer as it’s much harder to develop, test and debug code when it is split across multiple services.</p>
<h1 id="its-all-about-timing">It’s all about timing</h1>
<p>There are some red flags which have helped us decide that it’s time to start breaking apart the monolith.</p>
<ul>
<li>Restoring a database backup (which we test every week) now takes 5 hours, and the platform team don’t think they can optimise it further. So if we lose our database in an incident, we will have a total blackout for 5 hours while we recover the data.</li>
<li>We had a severe service degradation where our whole platform slowed down because a team was running a backfill to change how we store addresses. The backfill wasn’t rate limited, which caused our synchronous database replica to fall behind, slowing down all <code class="language-plaintext highlighter-rouge">write</code> requests across the system.</li>
<li>There’s an increasing number of out-of-date dependencies and we rely on a handful of long-tenured individuals to keep them up-to-date, even though we’ve invested in tooling to make it easy and safe.</li>
<li>Lots of support tickets are being left without an owner as they bounce between different teams, who are spending more time arguing about ownership than the time it would take to fix the ticket.</li>
<li>A new team that’s starting a greenfield project about rewards isn’t delivering at the speed they’d hoped. When we investigated further, it became clear that a lot of their time was being spent putting in guardrails to ensure that their experimentation didn’t risk impacting the core product.</li>
</ul>
<h1 id="our-first-service">Our first service</h1>
<p>To break a monolith, we have to start by creating our first service. Our plan is to start with a single component and complete that extraction before starting to work on multiple components in parallel across different teams. This incremental approach will help gather the learnings and build the infrastructure before we try to repeat the process many times over.</p>
<p>Another advantage of this approach is that it allows us to protect most of the engineering team from the inevitable teething problems of microservices, as this would both burn development time and the social capital of our ‘breaking the monolith’ project. We want to create a blueprint (with some lessons learned) which can be shared across the organisation and allow multiple teams to start extracting components.</p>
<p>Choosing the first component is important: there’s a Goldilocks Zone where the component is not so large and coupled that it becomes an impossible task, but not so small and isolated that we won’t learn anything from extracting it.</p>
<p>Also important here is the team which owns the component: ideally it should be a team that is working well together, has a reasonably limited maintenance load (so they’re unlikely to get distracted), and is interested (and ideally has some experience) in the underlying infrastructure.</p>
<p>We’ve decided that our first new service will manage our webhook sending code. When an employee changes their personal information in HR-World, we send a webhook (if configured) so that their employer can update it on their other systems (e.g. payroll). We also send webhooks whenever an employee joins and leaves the company, or changes role. We believe this component is in the Goldilocks Zone in terms of size and coupling. It is also at the edge of our service: there are no downstream dependencies so it’s easy to imagine how to extract it from the core system. The team that owns the integrator-facing part of our product is well established as a high-performing team, and one of the members has been helping drive the ‘breaking the monolith’ initiative within engineering.</p>
<h1 id="business-logic-changes">Business Logic Changes</h1>
<p>The team has a number of improvements they’d like to make to the webhooks, including:</p>
<ul>
<li>Allowing integrators to choose which webhooks they receive</li>
<li>Batching the webhooks more effectively so we send fewer, bigger webhooks</li>
<li>Reducing the average latency (from when the employee takes an action, to when the webhook is sent) which currently spikes when the database is under heavy load.</li>
</ul>
<p>It is tempting to use this as an opportunity to completely rewrite our webhook sending application code and alter the business logic. This has a number of benefits:</p>
<ul>
<li>Coupling the technical re-architecture with more tangible and immediate business value can make the project easier to prioritise, and can act as a forcing function to push the organisation to release the new service (sometimes a service can sit in ‘shadow mode’ for weeks or even months if it’s not urgently required).</li>
<li>It’s a great opportunity to reduce the complexity of the component in areas where it isn’t providing value. We currently support multiple authentication mechanisms on our webhooks, but we built that to support a particular customer who has since churned. Maybe we can remove that functionality? Perhaps we can also drop that legacy code which we haven’t used in 2 years?</li>
<li>It may be required as it may be expensive or impossible to reproduce the current logic while isolating it into its own service (e.g. if the current design relies on cross-component transactions).</li>
</ul>
<p>There are also drawbacks to this approach:</p>
<ul>
<li>We’ve got a long list of features we’d like to add, if we try to build all of them into this project then we’ll take multiple quarters to complete it. That will slow down gathering learnings from the microservice part as we won’t learn about that until we’re done messing with the business logic.</li>
<li>The team’s focus is now split between the business logic changes and the service deployment changes, and they may not focus enough on the deployment side to gather learnings effectively</li>
<li>It can be hard to evaluate the performance and correctness of the microservice if it cannot be easily compared to the existing code within the monolith</li>
</ul>
<p>Incremental extraction can help us balance these concerns.</p>
<h1 id="incremental-extraction">Incremental Extraction</h1>
<p><img src="/assets/img/btm_microservice.png" width="800px" /></p>
<p><strong>Step 1:</strong> Create an isolated component inside the monolith, making any required / desired business logic changes. In our case, we’re going to focus on the webhook batching and latency as they are closely connected to the deployment mechanism, and the highest priority items. This allows the team to focus on (and test) those changes, reasoning about them in the context of ‘how can we extract this to its own service’. It’s important that we avoid implementing anything which can’t be supported across services at this stage or it’ll make the extraction much more painful. Once this is deployed, the team will have found all the interfaces and hidden dependencies / coupling in the system, which will be invaluable when extracting the service. We can take some time here to measure the performance of the system - latencies, average payload size etc. This will act as a baseline for when we extract it into its own service.</p>
<p>If the component is already highly isolated and decoupled with clearly defined interfaces (inside the monolith), we’ve essentially done the first step at some point in the past (yay) and can skip to step 2.</p>
<p><strong>Step 2:</strong> Extract the component into its own service. This is a great example of ‘do the hard thing to make the change easy’. We’ve just loaded the entire webhook component into the proverbial RAM of the team, and given ourselves some time for ideas about how to architect the new service to mature. Everyone is now aligned with what the challenges are for this extraction, and we can focus on the most difficult problems related to the new deployment structure. We should also take this opportunity to build reusable tools to help solve common microservices challenges (e.g. inter service communication, transactional guarantees, tracing etc.). We’ll talk more about these in the next two posts - for the purposes of this story let’s assume that everything went swimmingly.</p>
<h3 id="go-go-go">Go, Go, Go!</h3>
<p>Once we have our blueprint, we can then start to parallelise our efforts by having multiple teams working on extracting different services.</p>
<p><img src="/assets/img/btm_multiservice.png" width="800px" /></p>
<p><em>Next Up: <a href="/2021/07/09/breaking-the-monolith-2.html">Transactional Guarantees</a>.</em></p>Lisa Karlin CurtisIt is a truth universally acknowledged that a scale-up in possession of a monolith must be in want of microservices.Reflections from GoCardless2021-06-10T00:00:00+00:002021-06-10T00:00:00+00:00https://paprikati.github.io/2021/06/10/leaving-gocardless<p>This is the first time in my career that I’ve left a company I really liked, which was quite a strange experience. I wanted to share some of the things I’ve learned along the way. I hope to turn many of these into full blog posts, so consider this the bitesize version.</p>
<h1 id="context-vs-skill">Context vs Skill</h1>
<p>When you join a new company, you need two things to progress - context and skill.</p>
<p><strong>Context:</strong> how do the systems work in this company / industry, and why.</p>
<p><strong>Skill:</strong> the transferable stuff: problem solving, design and architecture, communication etc.</p>
<p>I’ve found separating these concepts very useful, both for combating imposter syndrome
(that mistake was because I was missing <em>context</em> not <em>skill</em>) and also for building effective teams
(if you don’t have enough context or skill in a team, you’ll struggle to succeed).
It’s also a good way of prioritising your own work: in the first few months you’ll be focussing on context,
while probably picking up some skill on the way.
You can then pivot towards being more skill focussed (while of course still gathering context),
at which point you can be considered ‘ramped’. </p>
<h1 id="find-great-people-and-stick-to-them-like-glue">Find great people, and stick to them like glue</h1>
<p>The best way to learn (for me anyway) is from other people who have context or skill that you don’t.
Strive to never be the smartest person in the room - find the people with skills you admire and try to
find opportunities to collaborate and learn from them.
Also ask them for frequent feedback - and try to be really open even if you think they’re being unfair.
Although it stings, there’s probably some truth in what they are saying that it’s worth you reflecting on.</p>
<h1 id="learn-from-other-peoples-mistakes">Learn from other people’s mistakes</h1>
<p>Engineers learn a lot from making mistakes, which means your progression is rate limited by the speed at which you can ship things to customers
(and therefore the speed at which you can find out about the things you’ve got wrong).
You can accelerate this process by talking to other great people, and learning from their mistakes.
Everyone loves telling war stories, so take someone out for a coffee and ask for their ‘most memorable outages’.
You can also use other sources such as post mortems (internal and external), and podcasts (e.g. <a href="https://downtimeproject.com/">The Downtime Project</a>).</p>
<h1 id="incidents-are-a-great-way-of-learning">Incidents are a great way of learning</h1>
<p>Incidents (i.e. when something goes wrong) are a great opportunity to gather both context and skill.
If it’s possible to get involved, either in real time or after the fact, then do.
At first you’ll just be an observer (write down questions so you can ask someone after the immediate danger has passed).
Then as you gather context, you’ll progress to contributing:
initially in small ways (e.g. maybe this dashboard will help) all the way up to running your own incident response.
You’ll get tonnes of context, get to know people from across the org, and learn how to build resilient and reliable systems.
It’s a key differentiator between <em>good</em> and <em>great</em> engineers.</p>
<h1 id="volunteer-for-things-to-make-your-own-opportunities">Volunteer for things to make your own opportunities</h1>
<p>It’s very rare that an organisation will push back if you offer to do something useful.
Of course there are some cases where this is hard (e.g. cultural resistance to a particular change,
or if you’re not delivering on your core role) but many people will have the leeway to try to push things they care about.
Use this as a lever to help you gather context or skill that you don’t have, and collaborate with people you want to learn from.</p>
<h1 id="harness-the-power-of-new-joiners">Harness the power of new joiners</h1>
<p>New joiners are awesome because they haven’t learned to <em>accept</em> the things that are broken or bad about your systems and processes.
They’ve also probably come from a place that did something better than you.
Use this knowledge and energy to your advantage - particularly for a fast-growing company it’s a huge potential source of good things.</p>
<h1 id="complexity--risk">Complexity = Risk</h1>
<p>I’ll definitely write a whole blog about this one day.</p>
<p><strong><em>TLDR</em></strong> complexity => risk => incidents => sad.</p>
<p>Complexity can be good if it helps your product be valuable; that doesn’t make it not risky, it just makes the risk worthwhile.
Fighting complexity is all about looking at the trade offs of the complexity you’re carrying vs. the product value it delivers.
This work often takes the form of removing tech debt or legacy features which can be unglamorous but is also really valuable.</p>
<h1 id="your-data-is-important">Your data is important</h1>
<p>If you can’t archive your data to a place where the sun doesn’t shine (which at GC, for various product reasons, we can’t),
then it matters that it stays in a good state.
We’ve been tripped up a few times due to ‘old, bad data’ that no-one ever cleared up.
Whenever we leave data around like this, we are forcing future developers to
(a) know this weird thing happened (they won’t) and
(b) know how to interpret this weird thing (they don’t).
Cleaning up data is often a faff, but it’ll pay off in the long term.</p>Lisa Karlin CurtisThis is the first time in my career that I’ve left a company I really liked, which was quite a strange experience. I wanted to share some of the things I’ve learned along the way. I hope to turn many of these into full blog posts, so consider this the bitesize version.Euruko2021-05-31T00:00:00+00:002021-05-31T00:00:00+00:00https://paprikati.github.io/2021/05/31/euruko<p>Breaking changes are sad. We’ve all been there; someone else changes their API in a way you weren’t expecting, and now you have a live-ops incident you need to fix urgently to get your software working again. Of course, many of us are on the other side too: we build APIs that other people’s software relies on. This talk hopes to help us understand when we might break someone else’s things, and how to mitigate the impact when it happens.</p>
<iframe width="800" height="460" src="https://www.youtube.com/embed/eEFcS_cmusQ?start=120" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>Lisa Karlin CurtisBreaking changes are sad. We’ve all been there; someone else changes their API in a way you weren’t expecting, and now you have a live-ops incident you need to fix urgently to get your software working again. Of course, many of us are on the other side too: we build APIs that other people’s software relies on. This talk hopes to help us understand when we might break someone else’s things, and how to mitigate the impact when it happens.Managing Cutovers2020-12-30T00:00:00+00:002020-12-30T00:00:00+00:00https://paprikati.github.io/2020/12/30/cutovers<p>As software engineers, sometimes we need to release significant code changes. And sometimes, we have to do it in a ‘big bang’. I’ve seen a few of these, both in an enterprise context (with weekend-long cutovers managed by teams of 20) and more recently at GoCardless. Here are some things I’ve learned along the way.</p>
<p>I am defining a <em>cutover</em> as releasing any software change in a ‘big bang’ fashion (i.e. it will impact all your customers at once). Cutovers generally represent points of high risk in software: stuff rarely breaks when you’re not changing something (scaling issues aside), and if you are forced to release something to everyone at once, then if stuff breaks it’s likely to have a significant impact. While this post has advice about how to de-risk such events, the best plan is to avoid them all together. In modern multi-tenant systems, there are many ways of slowly applying a change (feature flags etc.) which can help avoid the big bang, but assuming you have no other options, here are some things to consider.</p>
<h1 id="1-design">1. Design</h1>
<p>When designing the ‘to-be state’ of your system, it’s worth taking time to think about how you can get from here to there (i.e. thinking about the cutover). This shouldn’t fundamentally alter your design (that feels like the tail wagging the dog), but there are likely to be quick wins which help the cutover without cannibalising the design. There’s just no need to make your life any more difficult than it already is. It’s also worth thinking about potential intermediate steps at this early stage, which can have other benefits in terms of incrementally releasing code and learning more about it’s behaviour before the ‘big bang’ day.</p>
<h1 id="2-planning">2. Planning</h1>
<p>To be honest, successful cutovers are almost all about planning (the whole blog post could be titled planning), but these are some specific things to think about.</p>
<h2 id="-pre-mortem">📄 Pre-Mortem</h2>
<blockquote>
<p>What could possibly go wrong?</p>
</blockquote>
<p>So called ‘pre-mortems’ are a really useful exercise: they help identify the high risk parts of a change and trigger great conversations about possible mitigations. They also help everyone to gain a shared understanding of what people are scared of and why, which will help make good decisions going forwards.</p>
<h2 id="-timing">🕒 Timing</h2>
<blockquote>
<p>When should we do this?</p>
</blockquote>
<p>Consider when you will have the right people in the office in case of emergencies (cutovers over Christmas can be scary) but also your customers: it might be that an issue on your side results in your customer having to take some manual action. They’ll be much happier being told that at 11am on a Tuesday than 5pm on a Friday, or on Christmas Eve after their code freeze is in place. Of course you also want to consider usage patterns to avoid peaks. The riskiest thing you can do is to combine lots of cutovers together (which happens a <em>lot</em> in enterprises due to the release management process). It’s better to make two changes on different days where one isn’t quite at the perfect time, than decide that there is a perfect cutover slot once per quarter and increase your risk by making all your changes at once. This also goes back to our discussion above: making changes incrementally is often a lot safer.</p>
<h2 id="-communication">💬 Communication</h2>
<blockquote>
<p>Who do we need to tell?</p>
</blockquote>
<p>Some cutovers might only involve a couple of people (perhaps a routine infrastructure upgrade) while others might involve a variety of teams across the company, and need lots of external comms to prepare customers (e.g. docker introducing rate limits). This is very much ‘horses for courses’ - you need to consider the likelihood of failure, the impact of those potential failure scenarios and of course whether customers will notice a behaviour change. Even if you decide not to pre-warn customers, give your support and customer success teams as much information as possible about what’s happening. It’s effort up front but if you’re in the middle of an incident and the support team can handle customers’ questions without your input, you’ll be grateful.</p>
<h2 id="-measurement">📏 Measurement</h2>
<blockquote>
<p>What does success look like?</p>
</blockquote>
<p>Having a clear understanding of the metrics for success will make the end of the cutover smoother - perhaps there are some SLAs you can monitor or tests that you can run. There’s nothing more embarrassing than sending the ‘we cutover and everything was fine’ email, just to find a couple of hours later that something was in fact broken. It’s also useful to identify issues quickly, particularly in a multi-stage cutover where it might be easier to resolve the issue if you find it straight away.</p>
<h1 id="3-pre-work">3. Pre-Work</h1>
<p>As well as communicating, there’s lots of pre-work you can do in the run up to a cutover. The aim of the game here is to de-risk the cutover, so anything you can do before the day itself is another thing ticked off the list (and another thing that can’t go wrong). Examples of this include:</p>
<ul>
<li>Creating relevant rows in a new table (provided the data won’t change).</li>
<li>Dry-running whatever you can in a test environment.</li>
<li>Developing dashboards or other tools that will help monitor your progress.</li>
<li>Running queries to identify edge cases that may need manual intervention (think about what assumptions your migration code makes, and find the outliers.</li>
</ul>
<h1 id="4-on-the-day">4. On the day</h1>
<p>Again, the key here is planning.</p>
<ul>
<li>Give yourselves <strong>plenty of time and space</strong>. Don’t rush unless you have to, and try to avoid multi-tasking where possible.</li>
<li><strong>Have a runbook</strong> that includes everything that needs to get done on the day, and try to keep it up-to-date if the plan changes.</li>
<li><strong>Focus on communicating</strong> so that everyone knows what the current status is. This will also help resolve issues if they arise.</li>
<li>Identify clear <strong>roles and responsibilities</strong>, ideally including a ‘commander’ co-ordinating the whole thing.</li>
<li>Use <strong>go/no-go checklists</strong> to help discussions about whether to proceed with a change.</li>
<li>Understand the various <strong>rollback points</strong> (and have a strategy, ideally that you’ve tested). Also call out the ‘<strong>point of no return</strong>’ if there is one.</li>
</ul>
<h1 id="5-aftercare--post-go-live-support-pgls">5. Aftercare / Post Go-Live Support (PGLS)</h1>
<p>The highest risk of a cutover is usually just after you’ve made the change - surviving <em>first contact</em> with production is hard - so after-care is important, both in the short term (hours) and the medium term (days or weeks)</p>
<ul>
<li><strong>Stay alert</strong>: use your observability stack to assess general system health, and have someone with detailed knowledge of the change available to help triage any new issues.</li>
<li><strong>Review success indicators</strong>: use the metrics you determined in your planning phase to understand if the change is working as expected.</li>
<li><strong>Post mortem</strong>: review the cutover as a team, identify lessons learned (and any actions to mitigate in future). You may also want to gather feedback from other stakeholders (e.g. support or customers) as inputs to this.</li>
</ul>Lisa Karlin CurtisAs software engineers, sometimes we need to release significant code changes. And sometimes, we have to do it in a ‘big bang’. I’ve seen a few of these, both in an enterprise context (with weekend-long cutovers managed by teams of 20) and more recently at GoCardless. Here are some things I’ve learned along the way.