Fault tolerance mechanisms
Learn which mechanism allows the platform to be fault tolerant.
Global fault tolerance strategy
Multiple incidents can happen in a given system. We can have the database not responding anymore either because of a crash or a network issue, we can have the JVM on which Bonita Platform runs failing, or we can have external services not available. Each category of incident has a mechanism to recover from it.
-
When dealing with a failure on the host machine, like the JVM or hardware crashing, Bonita platform can be installed in a clustered architecture to ensure the High availability of the service. It means that if one node of the cluster fails, another one is available to continue executing works.
-
When dealing with external services' outage called by connectors, they can be replayed.
-
When dealing with incident like unresponsive database, we have two mechanisms to handle these errors the Retry mechanism and the Recovery mechanism
Retry mechanism
It is a reactive system, when an incident happens during the execution of a work, an exception is thrown. This mechanism analyses the exception and determine if the work should be retried automatically or not.
When the exception is retryable, the platform queues the work again, with a delay. That can be done up to 10 times
with the delay increasing gradually from 1 second to more than 1 hour by default.
The exact delay is randomized, to avoid repeated congestion situations (a typical case of those are database deadlocks, when retrying all the failed works at the same time will lead to more deadlocks).
See Work execution and Work executiuon audit for more details.
Configuration
Configuration for the retry is available in the bonita-tenant-community-custom.properties
and can be updated using the
setup tool
# Retry mechanism: retry the works when they fail because of an error that is transient
# maximum number of times a work will be retried before setting it as failed
bonita.tenant.work.maxRetry=10
# delay in millis before retrying the work
bonita.tenant.work.retry.delay=1000
# factor to multiply the delay with, between two subsequent retries
bonita.tenant.work.retry.factor=2
Above is the default configuration. With it, each work can be retried up to 10 times, starting with a delay of 1 second multiplied at each retry by 2.
Monitoring
The Work execution audit allows to be informed when the work is retried to many times.
Recovery mechanism
Starting from 2021.1 version, a specific mechanism is responsible to recover from incidents like database or network outage.
At startup, the platform restart all elements that were being executed, then the recovery mechnism scan every 2 hours the database and re-execute elements that should have being executed and were not updated during the last hour.
In cluster environment, only one node is responsible to run the recovery at any given time.
Configuration
Configuration for the recovery is available in the bonita-tenant-community-custom.properties
and can be updated using the
setup tool
The default values of those properties should work for everyone. If the recovery task takes more than a few minutes, you might want to change these values to run the recovery less often. Take a look at the metrics section to understand how to measure that.
# Recovery Mechanism: recreate works when they are lost due to incidents
# All following configuration should work for everyone, it can be changed only to do performance tuning in limit-cases
# Avoid verifying elements recently modified, by default no elements updated during the last hour is considered (ISO-8601 duration format).
bonita.tenant.recover.consider_elements_older_than=PT1H
# Duration after the end of the previous execution before a new one is started. By default recovery runs every 2 hours (ISO-8601 duration format)
bonita.tenant.recover.delay_between_recovery=PT2H
bonita.tenant.recover.delay_between_recovery
is the time between two scans of the database and also the time before the first scan after startup.
Monitoring
There are two ways to monitor the recovery mechanism :
-
bonita.xxx.log
file -
Metrics
Log File
The recovery mechanism produce INFO
andDEBUG
logs each time the recovery is trigger, it’s looks like :
INFO (internalTasksScheduler-1) org.bonitasoft.engine.tenant.restart.RecoveryMonitor Start detecting flow nodes to restart... INFO (internalTasksScheduler-1) org.bonitasoft.engine.tenant.restart.RecoveryMonitor Recovery of elements executed, 12006 elements recovered. INFO (internalTasksScheduler-1) org.bonitasoft.engine.tenant.restart.RecoveryMonitor Restarting elements...Handled 1000 of 12006 elements candidates to be recovered in PT0.025S [...] INFO (internalTasksScheduler-1) org.bonitasoft.engine.tenant.restart.RecoveryMonitor Restarting elements...Handled 12000 of 12006 elements candidates to be recovered in PT0.452S INFO (internalTasksScheduler-1) org.bonitasoft.engine.tenant.restart.RecoveryMonitor Recovery of elements executed, 12006 elements recovered.
Metrics
New metrics are available to monitor when the recovery runs and how many elements it recovers. It can help to identify period of times when there are incidents like database outage.
There is four metrics related to the recovery:
bonita.bpmengine.recovery.duration
bonita.bpmengine.recovery.execution
bonita.bpmengine.recovery.recovered.total
bonita.bpmengine.recovery.recovered.last
Here is an example of metrics published using the Prometheus publisher, more info on how to activate this publisher in Bonita Runtime Monitoring
# HELP bonita_bpmengine_recovery_duration_seconds_max duration of recovery task # TYPE bonita_bpmengine_recovery_duration_seconds_max gauge bonita_bpmengine_recovery_duration_seconds_max{tenant="1",} 0.0 # HELP bonita_bpmengine_recovery_duration_seconds duration of recovery task # TYPE bonita_bpmengine_recovery_duration_seconds summary bonita_bpmengine_recovery_duration_seconds_active_count{tenant="1",} 0.0 bonita_bpmengine_recovery_duration_seconds_duration_sum{tenant="1",} 0.0 # HELP bonita_bpmengine_recovery_recovered_last_elements number of elements recovered # TYPE bonita_bpmengine_recovery_recovered_last_elements gauge bonita_bpmengine_recovery_recovered_last_elements{tenant="1",} 0.0 # HELP bonita_bpmengine_recovery_recovered_total_elements_total Total number of elements recovered # TYPE bonita_bpmengine_recovery_recovered_total_elements_total counter bonita_bpmengine_recovery_recovered_total_elements_total{tenant="1",} 39768.0 # HELP bonita_bpmengine_recovery_execution_executions_total Number of recovery executed # TYPE bonita_bpmengine_recovery_execution_executions_total counter bonita_bpmengine_recovery_execution_executions_total{tenant="1",} 818.0