Let’s walk-through Pipes and Filters

Rakesh Malhotra
7 min readNov 11, 2021

We often come across situations in Software development that are quite common in nature and people have experienced them before, in one way or the other. These situations can be handled either by building a very custom and specific solution or by the use of an existing, widely accepted, and time-tested Design Pattern. Today the topic of our discussion is also one of the Design Patterns that has been around for quite some time now and has been the basis of quite a few well-known implementations.

The Problem

Applications perform a variety of tasks on the data they receive from a variety of sources. When all of those tasks are performed within the context of the same deployable it becomes difficult to optimize the overall processing. As an example, let’s assume an application is performing 4 tasks T1, T2, T3, and T4 in sequence with the data it receives. Of these tasks, T2 is the most complicated, CPU-intensive operation. Since all of these tasks are part of the same deployable, it becomes tricky to optimize T2 if the only thing that can be done for its optimization is to enhance the underlying hardware.

Apart from this there could be a need where the ordering of these tasks may need to change and/or a new task need to be brought in the overall processing requirements, for these, we may need to re-shuffle/re-factor the tasks in such a way that we can accommodate either a new sequence or a new task altogether. In either case, the change in the existing tasks is unavoidable and that would lead to the execution of regression cycles on already tested functionalities.

These needs make it difficult for the application to perform at its peak. The below picture depicts these tasks working as part of a single deployable and are tightly coupled that limits changes in them to accommodate dynamic business needs.

With that said, what are the options for tackling this? Let’s dive in…

The Proposal

The best thing one can do for this is to decompose the application in such a way that all these tasks can be individually controlled, managed, deployed, etc., i.e. they all have their own life cycle and are not dependent on each other. As mentioned in the example we gave in the ‘The Context’ section, task T2 is the most complicated, CPU intensive out of the 4 tasks that the application runs in order to process the incoming data. When we split these tasks since task T2 is more CPU intensive, we can scale out and provide more CPU power to it and that will reduce the overall processing time.

The order in which these tasks are to be executed will not change with this split, the change that will come in the overall processing would be that the tasks would independently be deployed, the resources on which they operate can be independently scaled up or down, can be re-factored without impacting others. What this change would also mean is that all these tasks would need to define the contracts upfront so that each of them can honor them while they are processing the real-time data.

Splitting them also brings another advantage in that is they become more manageable, reusable while being more efficient. While we get all these benefits there is a small caveat attached to it as well since the processes would be split into distributed machines, which would bring in additional complexity.

The Pattern

These tasks that we have been referring to are nothing but ‘Filters’ and the conduits by which they are connected with each other are called ‘Pipes’ and the design pattern that we have been after, is well know as Pipes and Filters Pattern. We can have as many Filters as we need for handling any business use case while each such Filter remains ignorant of any other Filter present in the overall chain.

The following diagram depicts the overall picture. In the chain, as we have learned we can have as many Filters as needed. One thing to note in this design is the efficiency of the overall processing boils down to the slowest running Filter in the chain. Hence in order to optimize the overall processing time, we just need to optimize the slowest running Filter which is a huge advantage as opposed to the conventional design where all the tasks were part of the same deployment.

Filter — It is a component that is tied to a Pipe and is mainly responsible for transforming or filtering the data that it is receiving from the Pipe it is connected with. There is no limit on the number of input and/or output Pipes associated with any Filter

Pipe — It is a conduit that connects one Filter with another and is mainly responsible for carrying the data. It normally has buffering capabilities so that in the events of subsequent Filter being unavailable it can retain the message and can pass it to the Filter as and when it comes back online

In the modern world, when everything is on the cloud and distributed across different Regions/AZs we can use our discretion to deploy whichever Filter to whichever Region/AZ combination allows each such task to run in close proximity with the data it needs.

When the data is being streamed from the Source, various Filters which are participating in the chain can work in parallel what that means is the previous Filter doesn’t need to be completed fully before the subsequent Filter can pick up the previous Filter’s output as its input.

Another aspect worth mentioning in the context of this Pattern is how Fault-tolerant it can be. Assuming we have 3 Filters running in succession on the data which is being emitted by a Source. Out of them if T2 happens to misbehave or fail randomly, what is going to happen with the changes that T1 has already committed to its datastore? We are not worried about T3, as nothing would have happened on it since the processing stopped at T2 itself. This is an important aspect that can’t be left aside for future handling. Well, in such cases, we need to build something similar to the ‘SAGA’ design pattern. This can bring a couple of aspects:

  1. Compensation
  2. Idempotency

When we opt for Compensation, the Filters which have executed successfully before something broke subsequently in the chain, need to have a path defined in their context which they can tread upon to bring their state to where they were before things changed at them.

And should we pursue on Idempotency route, we need to ensure that the Filters can handle the repeated invocation for similar data. What this means is they need to have a mechanism that can stop them from going into an inconsistent state even by multiple similar invocations.

An Example

So far we have been talking along with a theoretical example, now let’s dive in with this Pattern by taking one real-life example. Let’s say there is a Source that is emitting messages. These messages contain a string that is a mixture of numerics, alphabets, special characters. This string is totally jumbled up and what we as business users need from the system is to produce a message that consists of a string of numbers that are sorted in ascending order.

Alright, so we have a task at hand, so in summary, if we put down the requirements in bullet points it would look like as below:

  1. Remove Alphabets
  2. Remove Special characters
  3. Sort the list of numbers in ascending order

At a high level what we can think of is to define 3 Filters, let’s call them AlphabetFilter, SpecialCharacterFilter, and AscendingSortFilter

The jumbled-up string from the Source would land on AlphabetFilter which will produce a string that will just have a mix of special characters and numbers.

The SpecialCharacterFilter will pick this as an input and will produce a string that would just have unsorted numbers while filtering out all the special characters.

Lastly, the AscensingSortFilter will pick the unsorted number string and will produce a string of numbers that are sorted as desired. This list so obtained can be sent to other components of the system for further processing.

Conclusion

As we discussed above, this Pattern is great when we have to perform multiple operations on the data being emitted by a Source. There are considerations that we need to be watchful of while opting for this Pattern, some of them that we talked about were Complexity, Fault tolerance, Distributed Nature, Idempotency.

--

--