Traditionally, there have been many ways of releasing applications. This relates to both the different test environments and the production environment itself. However, with serverless applications, traditional strategies may not transfer so well due to the characteristics of those architectures. For applications that run and live on containers that themselves run inside clusters, there can be many different releasing strategies that can apply. For this post we will solely focus on a serverless releasing strategy that follows a ‘promoting on success’ model.
It is worth noting that this is more focused on the AWS Cloud platform, however, feel free to adapt certain aspects to different providers where it applies. Additionally, this article will not make any assumptions on CI/CD tool choices. We will only be staying at a medium/high level with regard to the AWS components and the pipeline flow.
Setting the scene
Picture the scene, we’ve all been there; a demonstration you’re leading is due to start in a few minutes and some engineers on the team in efforts to finish their ticket, merge some last minute changes into master before it kicks off and for some reason luck isn’t on your side today as consequently the pipeline notifies you that the newly deployed version of the application has just failed its post-deploy tests and is unstable. If you’re lucky, the pipeline will tell you – giving you a few minutes to quickly run out the door and leave the demonstration to someone else, however, some aren’t so lucky to have the luxury of being told by the pipeline and will in fact go on with the demonstration to the inevitable ‘NullPointerException’ (or any runtime exception equivalent) error when running a simple scenario. Now you have to face the music and tell all of the stakeholders at the review that it won’t happen again and you promise to show them next time – when in reality, it probably will happen again as it already happened once, especially due to it being a human error.
Comparison between deployment strategies
There has been many ways in which engineers have attempted to maintain stable builds in their environments, some of the most popular ones nowadays include Blue Green deployments and Canary releases. Although both are normally more used to support production releases, they can also be used in non-production environments. However, the entire idea behind them is to route certain portions of traffic to the new version of the application and to roll back if a specific threshold of issues arises. This is what makes these strategies effective and secure. They are great at reducing and handling the risk for when / if anything goes wrong – this is great for container-based application deployments as well as serverless deployments. That being said, although both strategies work well, depending on the stack and environment setup, they can be quite complex and tiresome to setup. For serverless architecture, you could argue that the effort of a Canary deployment strategy isn’t really warranted in environments that aren’t related to production.
This is where the strategy ‘promote on success’ comes in or as others know it, Blue Green. Instead of the traditional deploying, verifying and then rolling back, we instead deploy, verify and promote. With this flow we ensure that the running environment itself is never down due to a rollback. It also gives you more of an idea of what versions of the code made it to the environment.
Blue Green & Canary
The Blue Green & Canary strategies typically involve the same steps; deploying the application, verifying it works as expected and then switching the traffic over to the new version. Some would argue the only real difference is that with Canary deployments you redirect a portion of the real user traffic to the new version whilst also pointing portions of real user traffic to the old version whereas with Blue Green it is more of a big bang effect where you switch all traffic to the new version at the same time once verification is completed.
For serverless deployments, it is up to engineers themselves with regard to which one they want to use for their production deployments, however I think it is fairly safe to assume you wouldn’t really want to consider using the Canary releasing strategy to your test environments mainly due to the amount of overhead you would create with the additional point of not needing to use real user traffic as it’s not production. However, the Blue Green strategy does seem more sensible with regard to our non-production environments.
Promoting on Success
If you think of when a person is promoted, they are promoted because they have proven that they can do the job – in the most cases. The promoting on success strategy is no different. We only promote the application version if it has proven that it has passed all of the assurances beforehand. The Blue Green deployment strategy offers the same assurances. One of the main benefits is the separation between application versions. The reason I have called it ‘promoting on success’ is to really give it that “exactly what it says on the tin” feel as Blue / Green doesn’t really give you much information from its name. Upon deployment of a Lambda, AWS creates a new version (completely immutable snapshot of the code) of the code each time that you publish the function. To read more on Lambda versions, the AWS documentation has some great guides. By default, IaC frameworks like Serverless will automatically publish a new version of the Lambda every time you deploy your Lambda to AWS. The benefit with using versions results in there being a track history of amendments of the Lambda that allows you to retrospectively fallback to if things where to go wrong. You would do this by using versions and aliases. Aliases are another piece of functionality within AWS Lambda that is extremely useful when dealing with Lambda versions. On top of that, for the ‘promote on success’ strategy, aliases are one of the key components. They are essentially pointers to specific versions of a Lambda. As mentioned above, in an example where there is an alias called ‘prod’ and it points to a newly published version of 45 of a Lambda, and for some reason people are reporting new bugs, you can quickly ‘roll back’ by pointing the ‘prod’ alias to version 44 until all bugs have been fixed - which would then be within version 46. Although that sounds controlled and smooth, this is exactly what the ‘promote on success’ strategy is there to avoid. At no point should you need to roll back to a previous version. If you’re tests are thorough enough, the version that gets promoted into the environment, should not break anything.
To implement the ‘promote on success’ you follow the following steps, the order in which you do them will matter to a degree due to dependencies, however staying at a high level, this is what it includes:
- Deploy the new version of the Lambda
- Deploy API Gateway that always points to the $LATEST version of the Lambda e.g. API Gateway URL: http://www.api.example.com/application-name/latest
- Run test suites against the URL of the $LATEST API Gateway – this should run against the newly deployed version of the Lambda
- If the tests pass, make use of the Lambda aliases to promote the $LATEST version of the Lambda (this can be the environment type) e.g. preprod
- Deploy another API Gateway that always points to the environment alias of the Lambda e.g http://www.api.example.com/application-name/preprod
Following the above steps, you will achieve a ‘promote on success’ strategy for deployment. What we are ensuring by following this is:
- There is always an endpoint that directly exposes the latest version of the application Lambda at any given time.
- There is always an endpoint that directly exposes the latest stable version of a specific application Lambda as the API Gateway endpoint is pointing to the alias itself. For example, the preprod endpoint will always be pointing to the most stable preprod version of the application Lambda.
- By promoting the Lambda versions on test success, we ensure that whatever gets promoted through the pipeline lifecycle has passed the quality assurances we have set in place – whatever they may be. This is an important point as the entire strategy relies on the fact that there has been an investment into the thoroughness and effectiveness of the testing. After all, this is what defines your “stable” environment.
As described above, you now have a pipeline that deploys the application Lambda to specified environments only on test success, ensuring a more stable environment that will not fall over during demos. This removes the traditional rollback on failure approach where your adding additional steps that run if things go wrong, which of course, can go wrong themselves.
It’s worth mentioning that when you create and deploy both of the API Gateways, there will be no need to redeploy them – unless of course they change. The new version of the Lambda will be the only component that gets updated with each pipeline run. How you integrate the above steps into your pipeline is up to you, you can do this all by CloudFormation or by bash scripts – whichever suits your approved technology stack. But I have found that this works pretty well in my experience as you can even tie in the promoted versions to release note automation. Additionally, with this simple approach, you get a choice of what type of testing goes on in the verification stage, therefore increasing the stability and security of the successful promotion.
Not everything is perfect, it’s worth mentioning that there is a possible drawback to this approach and that is race conditions. If you have an application that is constantly being contributed to due to hundreds of commits every hour – depending on the complexity of the solution – you will suffer race conditions. More specifically, if 2 or more pipelines are running at the same time, then the latest version of the Lambda may not be the version of the lambda that the specific pipeline is concerned with due to another concurrent pipeline having just deployed a new version. I advise you to be wary in these situations. However, although there are CI tools now that allow you to block parallel builds to stop these racing conditions from occurring, I would still say it is wise to be wary.
As always, any questions find me on the socials.