One of our customers British Telecom had previously faced an issue with a very high traffic data analytics based web application. For that- when a customer from our customer’s side logs into their portal, s/he used to see a lot of analytical data on the dashboard. As data is user customized hence it needed to evaluate at run-time. Mostly it worked pretty well in normal hours but in busy hours login took more than a minute and faced cases of timing out. We solved this problem with the message queue.
Now you might be thinking what is a message queue.
Well, Message queuing allows applications to communicate by sending messages to each other. The message queue provides a temporary message storage when the destination program is busy or not connected.
The basic architecture of a message queue is simple, there are client applications called producers that create messages and deliver them to the message queue. Another application, called the consumer, connect to the queue and get the messages to be processed. Messages placed onto the queue are stored until the consumer retrieves them.
A message queue provides an asynchronous communications protocol, a system that puts a message onto a message queue does not require an immediate response to continuing processing. Email is probably the best example of asynchronous messaging. When an email is sent can the sender continue processing other things without an immediate response from the receiver? This way of handling messages decouple the producer from the consumer. The producer and the consumer of the message do not need to interact with the message queue at the same time.
So what happened is-
We had a distributed architecture and having a load balancer to pass the request one of the nodes. That didn’t work as these requests varies from user to user and that can take the processing of 1 second to 15 seconds due to computation complexities and underlying downstream bottleneck.
We could scale up or scale horizontally but we’ve noticed that on many occasions, 2 out of 5 nodes are at 100% CPU usage and others at 50% or even less. CPU that is at 100% are probably occupied with hundreds of threads to perform complex data analytics where other nodes are probably serving requests that have low computation complexities and downstream responses are relatively simple and fast. The distribution of work is totally due to the decision made by the load balancer.
On the bad times, there are cases where customers failed to log in and page timed out, just because load balancer assigned the job to an existing node which already too busy to serve the request. After a significant amount of analysis, we wanted to try-out and re-architect the solution in a different way.
Rather than pushing the workload, we wanted to implement pulling the workload based on the server capacity and heuristic data based on historical data for that user. We somewhat managed to implement a bit of machine learning in this scenario with very simple statistical classification and that just nailed it. I will go more into details, how we have done it.
We took the brave decision removing the load balancer and replace with a custom application that we have written which is capable of queuing the requests when the user tries to log into the system and worker nodes to pick up from the queue one by one.
This somewhat fixed our problem to a certain extent, however, we could not use the CPU across the nodes, also there were cases of a bottleneck in downstream systems and the response time didn’t drastically improve. We knew that there is a room for improvement on this issue and the business is growing we will hit a dead end and that is not so far away.
We picked the phase two of our work and re-architect it again on top of phase one. Now rather than blindly depending on the nodes to pick up the work, the request will be queued based on some historical data of previous login details.
We created four different queues based on the weighted cumulative average time taken by last 10 logins of that particular user. The highest weight is set based on recent logins i.e. say latest login is n where the weight of n =10, n -1=9, n – 2=8 and so on. We found that the weighted cumulative gave us more nearest approximation depend on. There is also a heuristic mechanism on the nodes to identify an approximate time remaining of all the work are in progress that is being processed currently by the working node. This gives us a great flexibility to pick up a work from the defined queues. The diagram below shows strategic queues are formed and categorized based on time, that generally takes to process the analytical data based on previous login sessions.