Software Engineering

Scalable serverless workflows using BPMN, Zeebe and Fn Project

Artun Subasi
Artun Subasi

Imagine that you have a serverless architecture with a Function as a service (FaaS) platform and your functions scale great horizontally. Yet there are many ways to orchestrate FaaS functions to build a workflow. Some don’t scale well, some are hard to understand. In my previous article, I compared multiple approaches and shared my small workflow experiment. I’m convinced that the Business Process Model and Notation (BPMN) is a great way to create scalable serverless workflows which are self-explanatory. Since horizontal scalability is always a big issue, I’d like to demonstrate how BPMN-based workflows can scale horizontally by using Zeebe as the workflow engine and Fn Project as the FaaS platform.

BPMN and horizontal scalability

One huge limiting factor for horizontal scalability is the dependency on external databases. To demonstrate the problem, let us start with the following simple scenario using a typical BPM engine:

Scaleable serverless workflows - Step 1

In the above scenario, we have a couple of managed FaaS functions. We have a BPM (workflow) engine to orchestrate the functions using BPMN-based workflows. And finally, we have a workflow database in which the BPM engine stores all workflow related data such as workflow definitions, running workflows and input/output data.

Let’s scale everything up step by step.

Horizontally scaling the FaaS functions

When using FaaS functions, the FaaS platform is responsible for scaling and running our functions as needed.

Scaleable serverless workflows - Step 2

If the load is very high, the platform can run a lot of FaaS functions. However, they all depend on BPM engine which orchestrates the functions. The BPM engine forms a bottleneck and a single point if failure because it orchestrates every function in the FaaS platform.

Horizontally scaling the BPM engine

Adding more BPM engines can distribute the load during the workflow processing and increase the overall robustness of the system.

Scaleable serverless workflows - Step 3

However, the BPM engines still share the same database. We just moved the bottleneck from the BPM engine to the database. The database is still a potential single point of failure.

Horizontally scaling the shared database

The next improvement could be introducing a database clustering.

Scaleable serverless workflows - Step 4

Such a master-slave database cluster provides higher availability, but performance will remain an issue. Adding a horizontal database partitioning may be another idea if the BPM engine supports it.

I personally think that an architecture as demonstrated above will be sufficient for the majority of cases. In many cases, using a BPM engine with a relational database would be more than enough.

But what if you really have to scale up? The scalability of your functions is given by the FaaS platform. You can scale the BPM engines by deploying hundreds of them on demand. But when it comes to the external database dependency, you are limited by the clustering and partitioning mechanisms of the database provider.

How Zeebe tackles the problem

Zeebe, which is a BPMN-based workflow engine for microservices orchestration, addresses the problem by removing the dependency to the external database and letting the workflow engine distribute workflow related data in a peer to peer network.

Scaleable serverless workflows with Zeebe

The above diagram demonstrates how a Zeebe cluster looks like in a simplified way. Instead of depending on an external database, the workflow engine is distributed across Zeebe brokers which store their own data. The Zeebe brokers form a peer-to-peer network with no single point of failure. The brokers distribute the workflow data in partitions to enable horizontal scalability. The data may be also replicated across the brokers to increase fault tolerance. The details about the protocols can be found in the clustering section of the Zeebe documentation.

From Push to Pull with the publish-subscribe interaction model

In contrast to BPM engines which typically call services synchronously, Zeebe uses a publish-subscribe interaction model to ensure a loose coupling between tasks and their workers.

Assuming we are using the Fn Project as a FaaS platform, a BPMN workflow with a typical BPM engine could look as follows:

Scaleable serverless workflows - Push Method

The workflow contains three tasks. Each task calls a function within the Fn Project platform synchronously. If the function is not available at the execution time, the workflow instance would lead to an error state.

Zeebe publish/subscribe-pattern inverses the call direction. Each task provides a job worker registration interface:

Scaleable serverless workflows - Pull Method

In this case, the jobs in Zeebe, as well as the functions in the Fn Project, only provide interfaces. The jobs in Zeebe are waiting to be polled by workers. The Fn functions are waiting to be invoked externally. The publish-subscribe paradigm has many advantages, but it also increases the complexity because we now need workers as a glue between the available jobs in Zeebe and the Fn functions.

Connecting Zeebe jobs and Fn functions with workers

Zeebe currently provides clients for Java and Go.  Workers can programmatically register themselves to pull and execute tasks. The following code demonstrates the usage of the Java client to start a new worker to work on a job type:

The above code would start a worker to poll for the job type “collect-money”. As soon as a job is available, the defined handler would be executed which does “something” and then completes the job. The completion of the job would lead the workflow instance to resume to the next job “Fetch Items”. You can find a full example including library dependencies, imports, etc in the Zeebe documentation.

In our case, we just want the job workers to call the existing FaaS functions as follows:

Scaleable serverless workflows - Zeebe Workers

Automating the worker implementations with Fn extensions

The above picture with the workers looks complicated to me. It also sounds like a lot of work to implement Java/Go clients to poll tasks and call the functions. Because in the end, we actually just want to connect Fn functions to the job types as shown in the next image:

Scaleable serverless workflows - Simplified

There are different ways to achieve this goal without having to implement every worker all by yourself. Implementing a simple standalone application to automatically start workers could enable connecting the job types and functions. That would introduce yet another application and complexity in the system.

Luckily, the Fn Project has an extension mechanism for the Fn Server. Among other things, an Fn extension can listen to events such as “new function deployed” and “function updated”. Using such events it is possible to connect Fn functions to Zeebe jobs via configuration as soon as they are deployed.

BPMN, Zeebe and Fn Project

The above diagram shows a simplified version of the architecture of a possible extension. Stay tuned for my next post in which I will demonstrate the concept of such a Zeebe extension for the Fn Project. 

References