The Hurdles of auto scaling an Application to reduce costs

Hello everyone,

We are going to share our experience when enabling auto scale in a cloud designed application. In this case a series a problems surged while enabling auto scale. We find this happen on several projects from different clients. So we hope you can learn from common mistakes and save time when deploying your next application. I’m also looking to see what is your approach in this matter.

Important: This post is mainly focused on a case using Azure App Service but it has a correlation to any environment where you can Scale automatically or manual a cloud application or service.

About the project

First we need to explain the project. It was a complex backend application in C#. The architecture of the project was design with scaling in mind. It consisted of Frontend and Backend App Service Servers. Each server contained many Webjobs. A series of Queues with different messages were used to communicate between the Webjobs and frontend with backend. There was a number of retry attempts in case of errors processing the messages and poison queues.

With this architecture the program already was running in multiple regions with multiple servers. And in each region it was configured with App Service using a considerable number of instances.

The Proposal

The Application was working as expected. And a new objective was defined to reduce the costs. To achieve this we needed to make each region capable to adapt the number of instances based on demand.

App Service provides two modes to scale:

* Automatic
* Manual

Automatic mode scales instances based on the load of CPU. This could be useful for a frontend or CPU intensive tasks but not for our case. In our case our load was based on the amount of messages in queues. We decided to use the manual option. Using the API to manage the number of instances we implemented a custom logic to scale the number of instances. The algorithm scaled up if queue depth increased beyond custom limits for the number of instances.

Queue Depth	Number of instances
10	2
40	4
80	8
200	12
500	16
1000	20

To decrease the number of instances it waits a few minutes in case the surge goes back up. This is because the time to spin a new server usually goes over 90 seconds. We don’t want to decrease just to increase it again in the next minute.

The Problem

After these changes were deployed to production for a few days we started to notice a couple of strange errors.

At first when debugging some errors in the logs there were cases when we followed the flow of the logging messages for a request message chains it wab broken. It suddenly stops in a random place. We knew this was an issue but couldn’t find the reason.

Then after a few weeks we found a couple of requests whose messages status were inconsistent. It appeared like it was added to a following step queue but retained the status as not processed. Upon further inspection the same random stop of logging messages were found. This was a severe issue now cause we had to guarantee that the 100% of queue messages were processed correctly or at least send an error notification to the client.

The solution

After a couple of days of reading and learning what could cause this sudden interrupts. We found that the App Service hold logs that were not present in Application Insights Logs. We started correlating the logs event dates to the auto scaling and found that they occurred after the down scale of a server.

Once we could identify the root cause of the problem. We could start attacking the issue. We found that we couldn’t decide which server instance was going to be shutdown when down scaling. But we could implement a graceful shutdown. A graceful shutdown means that App Service send a signal to the process running on the server thats going to be shutdown. We added a custom logic with semaphores to prevent the WebJobs from starting to process new messages from queues once shutdown signal was sent. As the operations of enqueueing a message and changing its status in the database were not atomic we had to prevent any WebJob after the graceful shutdown signal from doing this operations. This way we ensure no data remained inconsistent.

This way we resolved the main issues generated by the journey of auto scale. There were others problems that surged but we will cover them in another post.

Hope you learnt something from one of the most common mistakes during scale of an application.

Regards.