Skip to main content

By Chris Hall, Senior Consultant at Nimble Approach.

This solution is implemented using AWS, provisioned with Terraform, and configured via Ansible – but the principles broadly apply.

Many organisations rely on legacy services. These systems are often business-critical, with companies relying heavily on them. Since they are generally working, they receive little attention, leading to the teams that manage them being unfamiliar with their inner workings.

Sooner or later, a moment arrives when everything needs to be upgraded at once: the instance type, the data volume, the operating system, and the service version itself. This creates a high-stakes scenario.

Here’s an example of a recent situation where we needed to carry out such an upgrade for a customer on AWS. The system in question is provisioned and configured using Terraform and Ansible.

As you read on, see if you can work out which critical system was involved.

EC2 service
Mysterious Legacy Service

The Problem: Live Upgrades are Hard

Performing these upgrades on these kinds of live, business-critical instances are a challenge. There are often no development or test environments to practise on. Furthermore, upgrading the service package can have unknown effects on its persistent data.

While the existing codebase, using tools like Terraform and Ansible, may be functional for day-to-day operations, it often doesn’t support a safe way to carry out a major upgrade.

The system we’re working with here involves a client that relies on JNLP to interact with the instance being upgraded. This is an important point to note later!

The Full Picture

Ideation: Finding a Path Forward

To reduce risk, a blue-green release strategy was chosen. This approach involves running two parallel production services: “green” for the legacy service and “blue” for the new one. It provides a method to test the new service in parallel and then switch live traffic over to it when ready.

Green to blue cutover via a toggle

One option considered was to run parallel instances of just the component being upgraded. The challenge then is deciding how to direct traffic between the two versions.

While using a custom HTTP header to route requests during the upgrade process might seem like a potential solution, this approach was ultimately not viable. AWS Application Load Balancers (ALBs) do not support routing based on arbitrary custom headers without explicit client-side configuration. More critically however, this method would not work for non-HTTP traffic, which operates at a lower network layer and is outside the scope of what ALBs can handle.

The better approach was to duplicate the entire stack. This leaves the current live service untouched while keeping complexity to a minimum. This path allows for refactoring the existing code to support two stacks and enable toggling between them.

Parallel deployments before cutover

The Solution: A Modern Approach with Infrastructure as Code

The first step was to refactor the existing Terraform code into a reusable module. Using the count meta-argument at the module level provides the ability to dynamically create and destroy entire stacks as needed.

The toggle itself is controlled by a variable within Terraform. This determines the DNS record and certificate, allowing the live state to be switched between the green and blue stacks. This toggling process takes only about four minutes to plan and apply.

To ensure safety, a validation condition was added to the Terraform variables, making it impossible for both stacks to be live at the same time.

Ansible was also enhanced. By adding an IsLive tag to the infrastructure, Ansible playbooks could safely target the non-live instance by default. Applying configuration to the live service became a deliberate action, rather than an accident.

Terraform Toggles with Validation

Toggle Time!

Once the new blue service was tested and ready, a pull request was merged and applied with Terraform to make the switch.

After the toggle was complete and the new service was confirmed to be stable, the old green environment was decommissioned by changing a single variable in Terraform.

This immediately helps reduce operating costs.

Best of all, this entire process is repeatable. The next time an upgrade is needed, a new green stack can be spun up and the pattern can be run again.

This robust pattern was used to upgrade a common but often mysterious service that many organisations rely on: Jenkins

Post-cutover with a shiny new version of Jenkins

Author’s Bio

Chris Hall is a Senior Consultant at Nimble Approach, with many years of experience as an amazing Platform Engineer. If you’ve had the pleasure of working with Chris, then you know he leaves any team, system, or business in a better place than it was before.