Scattering and Gathering the workload

6 min readApr 7, 2021

We have been living in a world that is predominantly governed by software, and especially in these unprecedented times where everyone feels comfortable getting access to groceries, medication, daily needs, etc. from the confines of their homes has become even more critical.

Software companies are churning out software that suites these needs and much more at an unmatched pace and trying to acquire more and more market value with new and innovative ideas. One thing although that has stood the test of times is the way to Architect and Design the software that is efficient, resilient, robust, scalable, maintainable, extensible, and what not!

There are many Architecture and Design patterns that software can be built upon, the one that has got the limelight in today’s scenario is none other than Microservices based architecture.

But just by using Microservices architecture ensures the guarantee to success? Maybe, May not be, let’s try and explore a scenario that is generally overlooked and is not given the importance and recognition that it deserves.

The scenario

When working on an architecture that is Microservices based, generally we end up having multiple microservices, all of them jointly producing a coherent user experience. There are times when some of the steps that need to be performed to process any given request can be taken out of the main flow or main thread and can be offloaded to separate jobs or threads. There are various advantages and benefits that one gets should this approach is taken. But there’s a caveat here!

Let us be specific now to make it more clear, there is Microservice A that emits an event that is specific to a customer (the weight of each such event depends on the number of employees the Customer has on its rolls) as part of some request processing. If the architecture uses Kafka as a broker, and also assume there’s a Microservice B with 4 consuming Pods (a Kubernetes terminology). This message would land on one of the partitions (let’s make another assumption that the Topic has 4 Partitions) of the Topic and would be consumed by only a single Pod of Microservice B. Refer to the below diagram for more clarity.

Now consider Microservice A is emitting events for customers which vary drastically in the number of employees on their rolls. For all practical and theoretical purposes, there are chances that any Pod out of those 4 Pods could potentially end up being too busy and the other Pods are relatively free or less busy.

As can be seen in the above depiction, POD 2 is incidentally receiving messages which are relatively bulkier (the ones which need more Memory and CPU during processing) in nature as compared to the other three Pods. There are chances that a busy Pod if remains in that state for quite some time may end up crashing and Kubernetes may have to spin up another Pod to maintain the balance in the cluster. What will happen to the inflight requests? What will happen to any uncommitted data present in Heap? Well these are serious concerns, aren’t they? Definitely, the problem needs to be addressed from a Design standpoint and can’t be just left alone on the infrastructural components.

What options we have

Let’s explore together what tools and techniques we have in our armory to handle this

Option#1 — Scaling
The very first thing that one can do is scale up the Pods to an extent that they can tolerate whatever traffic they may end up getting at any given point in time. This approach may work but who knows the system may have to process an event for a customer which has 1000s of employees. Moreover, if the Pods are not processing such huge events every time (which at most times is going to be the case) then we are literally wasting the overall capacity.

Option#2 — Publisher discretion
In this approach, Publisher which in our example is Microservice A needs to do the following:

Needs to categorize events into Large, Medium, and Small by checking all the possible factors which can cause additional load on the consumers. For the example we are working on, it will be the number of employees on rolls of a customer, but it certainly can vary
The next thing the Producer needs to maintain somewhere is a brief history of all events that it has emitted and to which partition. Using this the Producer can determine which of the consuming pods are relatively busier and the Producer can avoid emitting events to such partitions. Now the question may arise, how does the Producer can come to know that which Pod is busy vs not so? Well there are Metadata APIs that modern brokers like Kafka support and that can be used for this purpose

Of all the above things that the Producer needs to do, it appears that most of the stuff doesn’t align with the core context for which Microservice A came into existence in the first place, and hence the team working on developing Microservice A needs to accommodate this additional piece on their plates.

Option#3 — Scatter and Gather
This is one of the Enterprise Integration Design patterns and the core idea behind this being is to break the message into multiple parts, let these messages be processed independently, and finally, the results of the independent processing can be consolidated.

Now let us figure out how we can use this Pattern in the example that we have been talking about so far. In our example, Microservice A is emitting random events, now to apply this pattern, we have below alternates:

We can let Microservice A additionally take the responsibility for scattering the messages, i.e. instead of sending a single event for all the employees, it will first fetch the list of employees on a customer’s rolls and then will emit as many messages as many employees are there or alternatively,
Microservice A can continue sending a single event per customer and let Microservice B before processing the message, does get all the employees and emit as many messages as many employees it finds and then also consumes them for eventual processing

The problem with Microservice B donning an additional hat for this purpose is from design-wise it doesn’t appear aesthetic and also violates the single responsibility principle, whereas if Microservice A is doing this it appears neater and less coupled.

So with #1 above, the problem of scattering the messages is tackled, now let’s try and find out how to gather all that we have broken or scattered?

Since Microservice A has initiated the transaction which is spanning across Microservice B, it needs to know when to mark the transaction as either Success or a Failure. For this what Microservice B has to do is to maintain a track of all the messages that it expects vs how many it has received and how many of them it has processed (either Successfully or Unsuccessfully). Since all the Pods of the Microservice B are going to use a common persistence store, the above information can be maintained in that store and can be checked if everything is complete or something has failed or things are still in progress.

Post this, Microservice B can emit an Ack message back for Microservice A’s consumption and upon the receipt of this message, Microservice A can mark the transaction complete (Success/Failure based on whatever message it received from Microservice B)

Conclusion

As we saw in the above example this is not a message distribution problem in all respects in fact the problems rests on the weight of the messages being emitted by the Publisher and as explained above there are multiple ways we can handle this and whichever option we end up using has to deliberated based upon the use case at hand.

At the same time, the Scatter-Gather pattern should not be only used to speed up the overall request/task processing by parallelizing it but also to optimally utilize the overall resources which the cluster has.

Scattering and Gathering the workload

The scenario

What options we have

Conclusion

Written by Rakesh Malhotra