2017-12-27 15:15:04 -04:00
|
|
|
---
|
|
|
|
layout: pattern
|
|
|
|
title: Retry
|
|
|
|
folder: retry
|
|
|
|
permalink: /patterns/retry/
|
2019-12-13 21:09:28 +02:00
|
|
|
categories: Behavioral
|
2021-05-19 10:49:05 -06:00
|
|
|
language: en
|
2017-12-27 15:15:04 -04:00
|
|
|
tags:
|
2017-12-30 10:46:14 +02:00
|
|
|
- Performance
|
2020-07-26 11:30:42 +03:00
|
|
|
- Cloud distributed
|
2017-12-27 15:15:04 -04:00
|
|
|
---
|
|
|
|
|
|
|
|
## Intent
|
2020-09-01 20:25:39 +03:00
|
|
|
|
|
|
|
Transparently retry certain operations that involve communication with external resources,
|
|
|
|
particularly over the network, isolating calling code from the retry implementation details.
|
2017-12-27 15:15:04 -04:00
|
|
|
|
|
|
|
## Explanation
|
2020-09-01 20:25:39 +03:00
|
|
|
|
|
|
|
Retry pattern consists retrying operations on remote resources over the network a set number of
|
|
|
|
times. It closely depends on both business and technical requirements: How much time will the
|
|
|
|
business allow the end user to wait while the operation finishes? What are the performance
|
|
|
|
characteristics of the remote resource during peak loads as well as our application as more threads
|
|
|
|
are waiting for the remote resource's availability? Among the errors returned by the remote service,
|
|
|
|
which can be safely ignored in order to retry? Is the operation
|
|
|
|
[idempotent](https://en.wikipedia.org/wiki/Idempotence)?
|
|
|
|
|
|
|
|
Another concern is the impact on the calling code by implementing the retry mechanism. The retry
|
|
|
|
mechanics should ideally be completely transparent to the calling code (service interface remains
|
|
|
|
unaltered). There are two general approaches to this problem: From an enterprise architecture
|
|
|
|
standpoint (strategic), and a shared library standpoint (tactical).
|
|
|
|
|
|
|
|
From a strategic point of view, this would be solved by having requests redirected to a separate
|
|
|
|
intermediary system, traditionally an [ESB](https://en.wikipedia.org/wiki/Enterprise_service_bus),
|
|
|
|
but more recently a [Service Mesh](https://medium.com/microservices-in-practice/service-mesh-for-microservices-2953109a3c9a).
|
|
|
|
|
|
|
|
From a tactical point of view, this would be solved by reusing shared libraries like
|
|
|
|
[Hystrix](https://github.com/Netflix/Hystrix) (please note that Hystrix is a complete implementation
|
|
|
|
of the [Circuit Breaker](https://java-design-patterns.com/patterns/circuit-breaker/) pattern, of
|
|
|
|
which the Retry pattern can be considered a subset of). This is the type of solution showcased in
|
|
|
|
the simple example that accompanies this `README.md`.
|
2020-07-23 17:50:20 +03:00
|
|
|
|
|
|
|
Real world example
|
|
|
|
|
2020-09-01 20:25:39 +03:00
|
|
|
> Our application uses a service providing customer information. Once in a while the service seems
|
|
|
|
> to be flaky and can return errors or sometimes it just times out. To circumvent these problems we
|
|
|
|
> apply the retry pattern.
|
2020-07-23 17:50:20 +03:00
|
|
|
|
|
|
|
In plain words
|
|
|
|
|
|
|
|
> Retry pattern transparently retries failed operations over network.
|
|
|
|
|
|
|
|
[Microsoft documentation](https://docs.microsoft.com/en-us/azure/architecture/patterns/retry) says
|
2017-12-27 15:15:04 -04:00
|
|
|
|
2020-09-01 20:25:39 +03:00
|
|
|
> Enable an application to handle transient failures when it tries to connect to a service or
|
|
|
|
> network resource, by transparently retrying a failed operation. This can improve the stability of
|
|
|
|
> the application.
|
2020-07-23 17:50:20 +03:00
|
|
|
|
|
|
|
**Programmatic Example**
|
|
|
|
|
2020-09-01 20:25:39 +03:00
|
|
|
In our hypothetical application, we have a generic interface for all operations on remote
|
|
|
|
interfaces.
|
2017-12-27 15:15:04 -04:00
|
|
|
|
|
|
|
```java
|
|
|
|
public interface BusinessOperation<T> {
|
|
|
|
T perform() throws BusinessException;
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
2020-07-23 17:50:20 +03:00
|
|
|
And we have an implementation of this interface that finds our customers by looking up a database.
|
2017-12-27 15:15:04 -04:00
|
|
|
|
|
|
|
```java
|
|
|
|
public final class FindCustomer implements BusinessOperation<String> {
|
|
|
|
@Override
|
|
|
|
public String perform() throws BusinessException {
|
|
|
|
...
|
|
|
|
}
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
2020-09-01 20:25:39 +03:00
|
|
|
Our `FindCustomer` implementation can be configured to throw `BusinessException`s before returning
|
|
|
|
the customer's ID, thereby simulating a flaky service that intermittently fails. Some exceptions,
|
|
|
|
like the `CustomerNotFoundException`, are deemed to be recoverable after some hypothetical analysis
|
|
|
|
because the root cause of the error stems from "some database locking issue". However, the
|
|
|
|
`DatabaseNotAvailableException` is considered to be a definite showstopper - the application should
|
|
|
|
not attempt to recover from this error.
|
2017-12-27 15:15:04 -04:00
|
|
|
|
2020-09-01 20:25:39 +03:00
|
|
|
We can model a recoverable scenario by instantiating `FindCustomer` like this:
|
2017-12-27 15:15:04 -04:00
|
|
|
|
|
|
|
```java
|
2020-01-04 22:13:12 +05:30
|
|
|
final var op = new FindCustomer(
|
2017-12-27 15:15:04 -04:00
|
|
|
"12345",
|
|
|
|
new CustomerNotFoundException("not found"),
|
|
|
|
new CustomerNotFoundException("still not found"),
|
|
|
|
new CustomerNotFoundException("don't give up yet!")
|
|
|
|
);
|
|
|
|
```
|
|
|
|
|
2020-09-01 20:25:39 +03:00
|
|
|
In this configuration, `FindCustomer` will throw `CustomerNotFoundException` three times, after
|
|
|
|
which it will consistently return the customer's ID (`12345`).
|
2017-12-27 15:15:04 -04:00
|
|
|
|
2020-09-01 20:25:39 +03:00
|
|
|
In our hypothetical scenario, our analysts indicate that this operation typically fails 2-4 times
|
|
|
|
for a given input during peak hours, and that each worker thread in the database subsystem typically
|
|
|
|
needs 50ms to "recover from an error". Applying these policies would yield something like this:
|
2017-12-27 15:15:04 -04:00
|
|
|
|
|
|
|
```java
|
2020-01-04 22:13:12 +05:30
|
|
|
final var op = new Retry<>(
|
2017-12-27 15:15:04 -04:00
|
|
|
new FindCustomer(
|
|
|
|
"1235",
|
|
|
|
new CustomerNotFoundException("not found"),
|
|
|
|
new CustomerNotFoundException("still not found"),
|
|
|
|
new CustomerNotFoundException("don't give up yet!")
|
|
|
|
),
|
|
|
|
5,
|
|
|
|
100,
|
|
|
|
e -> CustomerNotFoundException.class.isAssignableFrom(e.getClass())
|
|
|
|
);
|
|
|
|
```
|
|
|
|
|
2020-09-01 20:25:39 +03:00
|
|
|
Executing `op` once would automatically trigger at most 5 retry attempts, with a 100 millisecond
|
|
|
|
delay between attempts, ignoring any `CustomerNotFoundException` thrown while trying. In this
|
|
|
|
particular scenario, due to the configuration for `FindCustomer`, there will be 1 initial attempt
|
2017-12-27 15:15:04 -04:00
|
|
|
and 3 additional retries before finally returning the desired result `12345`.
|
|
|
|
|
2020-09-01 20:25:39 +03:00
|
|
|
If our `FindCustomer` operation were instead to throw a fatal `DatabaseNotFoundException`, which we
|
|
|
|
were instructed not to ignore, but more importantly we did not instruct our `Retry` to ignore, then
|
|
|
|
the operation would have failed immediately upon receiving the error, not matter how many attempts
|
|
|
|
were left.
|
2017-12-27 15:15:04 -04:00
|
|
|
|
2019-12-07 20:01:13 +02:00
|
|
|
## Class diagram
|
2020-09-01 20:25:39 +03:00
|
|
|
|
2019-12-07 20:01:13 +02:00
|
|
|

|
|
|
|
|
2017-12-27 15:15:04 -04:00
|
|
|
## Applicability
|
2020-09-01 20:25:39 +03:00
|
|
|
|
|
|
|
Whenever an application needs to communicate with an external resource, particularly in a cloud
|
|
|
|
environment, and if the business requirements allow it.
|
2017-12-27 15:15:04 -04:00
|
|
|
|
|
|
|
## Consequences
|
2020-09-01 20:25:39 +03:00
|
|
|
|
2017-12-27 15:15:04 -04:00
|
|
|
**Pros:**
|
|
|
|
|
|
|
|
* Resiliency
|
|
|
|
* Provides hard data on external failures
|
|
|
|
|
|
|
|
**Cons:**
|
|
|
|
|
|
|
|
* Complexity
|
|
|
|
* Operations maintenance
|
|
|
|
|
|
|
|
## Related Patterns
|
2019-12-13 21:09:28 +02:00
|
|
|
|
2020-07-23 17:50:20 +03:00
|
|
|
* [Circuit Breaker](https://java-design-patterns.com/patterns/circuit-breaker/)
|
|
|
|
|
|
|
|
## Credits
|
|
|
|
|
|
|
|
* [Retry pattern](https://docs.microsoft.com/en-us/azure/architecture/patterns/retry)
|
|
|
|
* [Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications](https://www.amazon.com/gp/product/1621140369/ref=as_li_tl?ie=UTF8&tag=javadesignpat-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=1621140369&linkId=3e3f686af5e60a7a453b48adb286797b)
|