Feign, Hystrix, Ribbon, Eureka, are great tools, all nicely packed in Spring Cloud, allowing us to achieve great resilience in our massively distributed applications, with such great ease!!! This is true, at least till the easy part... To be honest, it is easier to get all the great resilience patterns working together with those tools than without, but making everything work as intended needs some studying, time and testing.
Unfortunately (or not) I'm not going to explain how to set all this up here, I'll just point out some tricks with error management with those tools. I chose this topic because I’ve struggled a lot with this (really)!!!
If you are looking for a getting started tutorial on those tools I recommend the following articles:
- Feign, encore un client HTTP ? (French)
- The Spring Cloud documentation
- The source code because we always end up there...
There will be code in this article, but not that much, you can find the missing parts in this repository
Dependencies
Let's say, after some trouble, you ended up with a dependency set looking like this one:
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-eureka</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-hystrix</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-feign</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-ribbon</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.retry</groupId>
<artifactId>spring-retry</artifactId>
</dependency>
Ok, so you are aiming at the full set:
- Of course, you are going to use
Eureka client
to get your service instances from yourEureka server
- So
Ribbon
can provide a proper client-side load-balancer using service names and not URLs (and decorateRestTemplate
to use names and load-balancing) - Then comes
Hystrix
with lots of built-in anti-fragile patterns, another awesome tool but you need to keep an eye on it (not part of this article...) - Finally, everything is packed up by
Feign
for really easy-to-write rest clients
This article uses the following versions of Spring Cloud:
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>1.5.13.RELEASE</version>
</parent>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-dependencies</artifactId>
<version>Edgware.SR3</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
Configuration
These tools need configuration, let's assume you have configured up something similar in your application.yml
:
spring:
application:
name: my-awesome-app
eureka:
client:
serviceUrl:
defaultZone: http://my-eureka-instance:port/eureka/
feign:
hystrix:
enabled: true
hystrix:
threadpool:
default:
coreSize: 15
command:
default:
execution:
isolation:
strategy: THREAD
thread:
timeoutInMilliseconds: 2000
ribbon:
ReadTimeout: 400
ConnectTimeout: 100
OkToRetryOnAllOperations: true
MaxAutoRetries: 1
MaxAutoRetriesNextServer: 1
This configuration will work if your application can register to Eureka using its hostname and application port. For production / cloud / any environment with proxies you need to have additional properties:
eureka.instance.hostname
with the real hostname to use to reach your serviceeureka.instance.nonSecurePort
with the non-secure-port to use oreureka.instance.securePort
witheureka.instance.securePortEnabled=true
Also this configuration isn't authenticated, it can be a good idea to add authentication to Eureka
, depending on your network.
From the Ribbon configuration I see you have confidence in your Web Services, 400ms for a ReadTimeout is quite short, the shorter the better!
We can also notice that all your services are idempotent because you accept to have 4 calls instead of 1 if your network / servers starts to get messy (yes, this Ribbon
configuration will make 4 requests if the response times out because it is actually doing: ( 1 + MaxAutoRetries ) x ( 1 + MaxAutoRetriesNextServer) = 4. So if you set 2 and 3 respectively, you will have up to 12 requests only from Ribbon
).
This gets us to the 2000ms Hystrix timeout, a shorter value will result in requests being done without the application waiting for the result so this seems legit (due to ribbon configuration : (400 + 100) * 4).
Customization
Everything goes well, you quickly understand that, for all FeignClient
s without fallback you only get HystrixRuntimeException
for any error. This exception is mainly saying that something went wrong and you don't have a fallback but the cause can tell you a little bit more. You quickly build an ExceptionHandler
to display nicer messages to users (because you don't want to put fallbacks on all FeignClient
).
One day you call a new external service and this service can have normal responses with HTTP 404 for some resources, so you add decode404 = true
to your @FeignClient
to get a response and avoid circuit breaking on those (if this option is not set, a 404 will be counted for circuit breaking). But you don't get responses, what you get is:
...
Caused by: feign.codec.DecodeException: Could not extract response: no suitable HttpMessageConverter found for response type [class ...
...
This is because the 404 from this service has a different form than "normal" responses (can be a simple String saying that the resource wasn't found). A cool idea here would be to allow Optional<?>
and ResponseEntity<?>
types in FeignClient
to get an empty body for those 404s.
AutoConfigured Spring Cloud Feign can map to ResponseEntity<?>
but will fail to deserialize incompatible objects. It cannot, by default, put results in Optional<?>
so it is still a cool feature to implement.
One way to achieve this is to define a Decoder
similar to this:
package fr.ippon.feign;
import java.io.IOException;
import java.lang.reflect.ParameterizedType;
import java.lang.reflect.Type;
import java.util.Optional;
import org.springframework.http.ResponseEntity;
import org.springframework.util.Assert;
import feign.FeignException;
import feign.Response;
import feign.Util;
import feign.codec.DecodeException;
import feign.codec.Decoder;
public class NotFoundAwareDecoder implements Decoder {
private final Decoder delegate;
public NotFoundAwareDecoder(Decoder delegate) {
Assert.notNull(delegate, "Can't build this decoder with a null delegated decoder");
this.delegate = delegate;
}
@Override
public Object decode(Response response, Type type) throws IOException, DecodeException, FeignException {
if (!(type instanceof ParameterizedType)) {
return delegate.decode(response, type);
}
if (isParameterizedTypeOf(type, Optional.class)) {
return decodeOptional(response, type);
}
if (isParameterizedTypeOf(type, ResponseEntity.class)) {
return decodeResponseEntity(response, type);
}
return delegate.decode(response, type);
}
private boolean isParameterizedTypeOf(Type type, Class<?> clazz) {
ParameterizedType parameterizedType = (ParameterizedType) type;
return parameterizedType.getRawType().equals(clazz);
}
private Object decodeOptional(Response response, Type type) throws IOException {
if (response.status() == 404) {
return Optional.empty();
}
Type enclosedType = Util.resolveLastTypeParameter(type, Optional.class);
Object decodedValue = delegate.decode(response, enclosedType);
if (decodedValue == null) {
return Optional.empty();
}
return Optional.of(decodedValue);
}
private Object decodeResponseEntity(Response response, Type type) throws IOException {
if (response.status() == 404) {
return ResponseEntity.notFound().build();
}
return delegate.decode(response, type);
}
}
Then, a @Configuration
file:
package fr.ippon.feign;
import org.springframework.beans.factory.ObjectFactory;
import org.springframework.boot.autoconfigure.web.HttpMessageConverters;
import org.springframework.cloud.client.circuitbreaker.EnableCircuitBreaker;
import org.springframework.cloud.client.discovery.EnableDiscoveryClient;
import org.springframework.cloud.netflix.feign.EnableFeignClients;
import org.springframework.cloud.netflix.feign.support.ResponseEntityDecoder;
import org.springframework.cloud.netflix.feign.support.SpringDecoder;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.context.annotation.Configuration;
import feign.codec.Decoder;
@Configuration
@EnableCircuitBreaker
@EnableDiscoveryClient
public class FeignConfiguration {
@Bean
public Decoder notFoundAwareDecoder(ObjectFactory<HttpMessageConverters> messageConverters) {
return new NotFoundAwareDecoder(new ResponseEntityDecoder(new SpringDecoder(messageConverters)));
}
}
Of course it is up to you to fit it to your exact needs, but this way you will be able to get proper responses.
Integration testing
All this really cool stuff can change from Spring Cloud one minor version to another (eg : Hystrix enabled by default to Hystrix disabled by default) so unless you aren't missing any update (I don't think it is possible) I strongly recommend adding good integration tests for this stack usage (unit tests will not be of any help here).
But having integration testing for this stack can be quite complicated. If we want to be as close as possible to reality we need:
- A running
Eureka
instance. - A running service registered on
Eureka
. - A running client using this service.
One way to do this is to set up a dynamic test environment with Eureka
and some applications but, depending on your organization, this can be really hard to achieve. Another way is to start all this in a Single JVM managed by JUnit thus integration with any build tool and CI platform will be really easy.
The drawback of this can be strange behaviors due to the Spring auto-configuration mechanism, it’s up to you to choose to make it in containers or this way, depending on what you can do.
To achieve this we will need to solve:
- The fact that we cannot use the native SpringTest class because it can only manage one application by default. We can work around this, by using
SpringApplication.run(...)
and play with the resultingConfigurableApplicationContext
. - The need to start on available ports. Simply add
--server.port
inSpringApplication.run(...)
withSocketUtils.findAvailableTcpPort()
, not even a problem. - The impossibility to use any kind of default configuration path unless we want all our apps to get this configuration. This one is also easy, just add
--spring.config.location
with a specific configuration in ourSpringApplication.run(...)
and we can have separate configurations. - The need for our applications to have configurations depending on the
Eureka
server port. For this one we will need to ensure thatEureka
is the first one to start (not needed for production, our client can handle this very well but will be annoying for tests) and then give theEureka
port one way or another to the other applications. - The fact that we can't, by default, start multiple Spring Boot applications on the same JVM instance because of JMX mbean name. Let’s disable it using
--spring.jmx.enabled=false
(or change the default domain using--spring.jmx.default-domain
with a different name) and we are OK. - Finally, a strange one, you know that Spring Cloud tools use
Archaius
to manage their configuration, not the default Spring configuration system.Archaius
takes Spring Boot configuration into account when the first application starts on the JVM, for the next one they aren't taken into account at the moment I'm writing this (check ArchaiusAutoConfiguration.configureArchaius(...) there is a staticAtomicBoolean
used to ensure that the configuration isn't loaded twice and "else" there is a TODO and a warn log). For our tests we will go for an ugly fix for this, reloading this configuration in anApplicationListener<ApplicationReadyEvent>
will do the trick.
I have done this here using mainly JUnitRules to handle the applications parts, feel free to take it if you like it and adapt those tests to your needs.
At the time of this writing, the project takes ~45sec to build, which is very slow considering that most of this time is for integration tests on already battle tested code... but I really don’t want to miss a breaking change in my usage of this great stack so I consider this time to be fair enough.
If you don’t need it remove the part testing circuit breaking on all HTTP error codes since those tests are very slow due to the sleeping phase…
Once again, really take the time to make strong integration tests on your usage of this stack to avoid really bad surprises after some months!!!
Going further
Depending on what you want to build, what we have here can be more than enough on the application side but if you are planning to use this in the real world, you really need some good metrics and alerts (at least to keep an eye on your fallbacks and circuit breaker openings).
For this you can check Hystrix dashboard and Turbine to provide you with lots of useful metrics to get dashboards with lots of those:
You will then need to bind it to your alerting system, this will need some work and you are going to need to handle LOTS of data since those tools are really verbose (if you want to persist that data pay attention to your eviction strategy and choose a solid enough timeseries infrastructure). Depending on your needs and organization tools a simple metrics Counter on your fallbacks can do a good job. Once set up in your applications this will only need a @Counted(...)
on your fallbacks methods.
It is also possible that the few tools discussed here are not antifragile enough for your needs, in that case, you can start by checking:
- Hystrix configurations you will see that there are plenty of things you can do (playing with circuit breaker configuration can really help in some cases). Don't forget to add integration tests to ensure that the configuration you are adding is really behaving as expected.
- Feign retries: I totally skipped this part but there is a built-in retry mechanism in Spring Cloud Feign coming on top of Hystrix and Ribbon mechanisms. You can check
Retryer.Default
to see the default retry strategy but this is kind of misleading in two ways:- First: if you have Hystrix Feign enabled the default retrier is
Retryer.NEVER_RETRY
(check FeignClientsConfiguration.feignRetryer()) - Second : even if you define a
Retryer
Bean toRetryer.Default
you won't get feign level retries by default because it is also important to checkErrorDecoder.Default
to see that we have aRetryableException
only when there is a well formatted date in theRetry-After
HTTP header.
So if you want to play with this you will need to : - define an
ErrorDecoder
that ends up inRetryableException
in the cases you want (or add theRetry-After
header in your services). - change the
Retryer
to the one actually retrying. - probably redefine the
Feign.Builder
Bean (be careful to keep the@Scope("prototype")
) to suit your needs.
- First: if you have Hystrix Feign enabled the default retrier is
So, do we go live?
This stack really is great and every developer using it daily will enjoy it, at least after one guy in the team spends days setting all this up to check some “vital” points :
- avoid retries on POST, PATCH and any non-idempotent services (should follow this)
- ensure that fallback calls and opened circuit breakers are tracked and explained
- ensure that
Eureka
is secured and not a SPOF (even withoutEureka
up and running the apps can talk to each other, at least for a fair amount of time) - ensure that some minor version change will not silently break all this anti-fragile stuff with strong integration tests.
In my opinion, this is a really great stack that needs a lot of work and understanding. So, make sure to use it only if you need it and otherwise stick to RestTemplate
until you have time to give it a good try!
Comments