331 lines
12 KiB
Markdown
Raw Permalink Normal View History

---
layout: pattern
2019-11-16 21:56:40 +02:00
title: Circuit Breaker
folder: circuit-breaker
permalink: /patterns/circuit-breaker/
categories: Behavioral
language: en
tags:
2020-07-26 11:30:42 +03:00
- Performance
- Decoupling
- Cloud distributed
---
## Intent
2020-08-29 12:01:23 +03:00
Handle costly remote service calls in such a way that the failure of a single service/component
cannot bring the whole application down, and we can reconnect to the service as soon as possible.
## Explanation
Real world example
2020-10-10 23:57:53 +05:30
> Imagine a web application that has both local files/images and remote services that are used for
> fetching data. These remote services may be either healthy and responsive at times, or may become
> slow and unresponsive at some point of time due to variety of reasons. So if one of the remote
> services is slow or not responding successfully, our application will try to fetch response from
> the remote service using multiple threads/processes, soon all of them will hang (also called
> [thread starvation](https://en.wikipedia.org/wiki/Starvation_(computer_science))) causing our entire web application to crash. We should be able to detect
> this situation and show the user an appropriate message so that he/she can explore other parts of
> the app unaffected by the remote service failure. Meanwhile, the other services that are working
> normally, should keep functioning unaffected by this failure.
In plain words
2020-08-29 12:01:23 +03:00
> Circuit Breaker allows graceful handling of failed remote services. It's especially useful when
> all parts of our application are highly decoupled from each other, and failure of one component
> doesn't mean the other parts will stop working.
Wikipedia says
2020-08-29 12:01:23 +03:00
> Circuit breaker is a design pattern used in modern software development. It is used to detect
> failures and encapsulates the logic of preventing a failure from constantly recurring, during
> maintenance, temporary external system failure or unexpected system difficulties.
## Programmatic Example
2020-08-29 12:01:23 +03:00
So, how does this all come together? With the above example in mind we will imitate the
functionality in a simple example. A monitoring service mimics the web app and makes both local and
remote calls.
The service architecture is as follows:
2021-01-30 14:44:41 +02:00
![alt text](./etc/ServiceDiagram.png "Service Diagram")
2020-08-29 12:01:23 +03:00
In terms of code, the end user application is:
```java
@Slf4j
public class App {
private static final Logger LOGGER = LoggerFactory.getLogger(App.class);
/**
* Program entry point.
*
* @param args command line args
*/
public static void main(String[] args) {
var serverStartTime = System.nanoTime();
var delayedService = new DelayedRemoteService(serverStartTime, 5);
var delayedServiceCircuitBreaker = new DefaultCircuitBreaker(delayedService, 3000, 2,
2000 * 1000 * 1000);
var quickService = new QuickRemoteService();
var quickServiceCircuitBreaker = new DefaultCircuitBreaker(quickService, 3000, 2,
2000 * 1000 * 1000);
//Create an object of monitoring service which makes both local and remote calls
var monitoringService = new MonitoringService(delayedServiceCircuitBreaker,
quickServiceCircuitBreaker);
//Fetch response from local resource
LOGGER.info(monitoringService.localResourceResponse());
//Fetch response from delayed service 2 times, to meet the failure threshold
LOGGER.info(monitoringService.delayedServiceResponse());
LOGGER.info(monitoringService.delayedServiceResponse());
//Fetch current state of delayed service circuit breaker after crossing failure threshold limit
//which is OPEN now
LOGGER.info(delayedServiceCircuitBreaker.getState());
//Meanwhile, the delayed service is down, fetch response from the healthy quick service
LOGGER.info(monitoringService.quickServiceResponse());
LOGGER.info(quickServiceCircuitBreaker.getState());
//Wait for the delayed service to become responsive
try {
2020-10-10 23:57:53 +05:30
LOGGER.info("Waiting for delayed service to become responsive");
Thread.sleep(5000);
} catch (InterruptedException e) {
e.printStackTrace();
}
//Check the state of delayed circuit breaker, should be HALF_OPEN
LOGGER.info(delayedServiceCircuitBreaker.getState());
//Fetch response from delayed service, which should be healthy by now
LOGGER.info(monitoringService.delayedServiceResponse());
//As successful response is fetched, it should be CLOSED again.
LOGGER.info(delayedServiceCircuitBreaker.getState());
}
}
```
2020-08-29 12:01:23 +03:00
The monitoring service:
```java
public class MonitoringService {
private final CircuitBreaker delayedService;
private final CircuitBreaker quickService;
public MonitoringService(CircuitBreaker delayedService, CircuitBreaker quickService) {
this.delayedService = delayedService;
this.quickService = quickService;
}
//Assumption: Local service won't fail, no need to wrap it in a circuit breaker logic
public String localResourceResponse() {
return "Local Service is working";
}
/**
* Fetch response from the delayed service (with some simulated startup time).
*
* @return response string
*/
public String delayedServiceResponse() {
try {
return this.delayedService.attemptRequest();
} catch (RemoteServiceException e) {
return e.getMessage();
}
}
/**
* Fetches response from a healthy service without any failure.
*
* @return response string
*/
public String quickServiceResponse() {
try {
return this.quickService.attemptRequest();
} catch (RemoteServiceException e) {
return e.getMessage();
}
}
}
```
2020-08-29 12:01:23 +03:00
As it can be seen, it does the call to get local resources directly, but it wraps the call to
remote (costly) service in a circuit breaker object, which prevents faults as follows:
```java
public class DefaultCircuitBreaker implements CircuitBreaker {
private final long timeout;
private final long retryTimePeriod;
private final RemoteService service;
long lastFailureTime;
2020-10-10 23:57:53 +05:30
private String lastFailureResponse;
int failureCount;
private final int failureThreshold;
private State state;
private final long futureTime = 1000 * 1000 * 1000 * 1000;
/**
* Constructor to create an instance of Circuit Breaker.
*
* @param timeout Timeout for the API request. Not necessary for this simple example
* @param failureThreshold Number of failures we receive from the depended service before changing
* state to 'OPEN'
* @param retryTimePeriod Time period after which a new request is made to remote service for
* status check.
*/
DefaultCircuitBreaker(RemoteService serviceToCall, long timeout, int failureThreshold,
long retryTimePeriod) {
this.service = serviceToCall;
// We start in a closed state hoping that everything is fine
this.state = State.CLOSED;
this.failureThreshold = failureThreshold;
// Timeout for the API request.
// Used to break the calls made to remote resource if it exceeds the limit
this.timeout = timeout;
this.retryTimePeriod = retryTimePeriod;
//An absurd amount of time in future which basically indicates the last failure never happened
this.lastFailureTime = System.nanoTime() + futureTime;
this.failureCount = 0;
}
2020-10-10 23:57:53 +05:30
// Reset everything to defaults
@Override
public void recordSuccess() {
this.failureCount = 0;
this.lastFailureTime = System.nanoTime() + futureTime;
this.state = State.CLOSED;
}
@Override
2020-10-10 23:57:53 +05:30
public void recordFailure(String response) {
failureCount = failureCount + 1;
this.lastFailureTime = System.nanoTime();
2020-10-10 23:57:53 +05:30
// Cache the failure response for returning on open state
this.lastFailureResponse = response;
}
2020-10-10 23:57:53 +05:30
// Evaluate the current state based on failureThreshold, failureCount and lastFailureTime.
protected void evaluateState() {
if (failureCount >= failureThreshold) { //Then something is wrong with remote service
if ((System.nanoTime() - lastFailureTime) > retryTimePeriod) {
//We have waited long enough and should try checking if service is up
state = State.HALF_OPEN;
} else {
//Service would still probably be down
state = State.OPEN;
}
} else {
//Everything is working fine
state = State.CLOSED;
}
}
@Override
public String getState() {
evaluateState();
return state.name();
}
/**
* Break the circuit beforehand if it is known service is down Or connect the circuit manually if
* service comes online before expected.
*
* @param state State at which circuit is in
*/
@Override
public void setState(State state) {
this.state = state;
switch (state) {
case OPEN:
this.failureCount = failureThreshold;
this.lastFailureTime = System.nanoTime();
break;
case HALF_OPEN:
this.failureCount = failureThreshold;
this.lastFailureTime = System.nanoTime() - retryTimePeriod;
break;
default:
this.failureCount = 0;
}
}
/**
* Executes service call.
*
* @return Value from the remote resource, stale response or a custom exception
*/
@Override
public String attemptRequest() throws RemoteServiceException {
evaluateState();
if (state == State.OPEN) {
2020-10-10 23:57:53 +05:30
// return cached response if the circuit is in OPEN state
return this.lastFailureResponse;
} else {
// Make the API request if the circuit is not OPEN
try {
//In a real application, this would be run in a thread and the timeout
//parameter of the circuit breaker would be utilized to know if service
//is working. Here, we simulate that based on server response itself
var response = service.call();
// Yay!! the API responded fine. Let's reset everything.
recordSuccess();
return response;
} catch (RemoteServiceException ex) {
2020-10-10 23:57:53 +05:30
recordFailure(ex.getMessage());
throw ex;
}
}
}
}
```
2020-08-29 12:01:23 +03:00
How does the above pattern prevent failures? Let's understand via this finite state machine
implemented by it.
2021-01-30 14:44:41 +02:00
![alt text](./etc/StateDiagram.png "State Diagram")
2020-08-29 12:01:23 +03:00
- We initialize the Circuit Breaker object with certain parameters: `timeout`, `failureThreshold` and `retryTimePeriod` which help determine how resilient the API is.
- Initially, we are in the `closed` state and nos remote calls to the API have occurred.
- Every time the call succeeds, we reset the state to as it was in the beginning.
2020-08-29 12:01:23 +03:00
- If the number of failures cross a certain threshold, we move to the `open` state, which acts just like an open circuit and prevents remote service calls from being made, thus saving resources. (Here, we return the response called ```stale response from API```)
- Once we exceed the retry timeout period, we move to the `half-open` state and make another call to the remote service again to check if the service is working so that we can serve fresh content. A failure sets it back to `open` state and another attempt is made after retry timeout period, while a success sets it to `closed` state so that everything starts working normally again.
## Class diagram
2020-08-29 12:01:23 +03:00
![alt text](./etc/circuit-breaker.urm.png "Circuit Breaker class diagram")
## Applicability
2020-08-29 12:01:23 +03:00
Use the Circuit Breaker pattern when
- Building a fault-tolerant application where failure of some services shouldn't bring the entire application down.
2020-08-29 12:01:23 +03:00
- Building a continuously running (always-on) application, so that its components can be upgraded without shutting it down entirely.
## Related Patterns
- [Retry Pattern](https://github.com/iluwatar/java-design-patterns/tree/master/retry)
## Real world examples
* [Spring Circuit Breaker module](https://spring.io/guides/gs/circuit-breaker)
* [Netflix Hystrix API](https://github.com/Netflix/Hystrix)
## Credits
* [Understanding Circuit Breaker Pattern](https://itnext.io/understand-circuitbreaker-design-pattern-with-simple-practical-example-92a752615b42)
* [Martin Fowler on Circuit Breaker](https://martinfowler.com/bliki/CircuitBreaker.html)
* [Fault tolerance in a high volume, distributed system](https://medium.com/netflix-techblog/fault-tolerance-in-a-high-volume-distributed-system-91ab4faae74a)
2020-07-26 11:30:42 +03:00
* [Circuit Breaker pattern](https://docs.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker)