Nugget Friday - Building Resilient Microservices with MicroProfile Fault Tolerance
Published on 08 Nov 2024
by Luqman SaeedWhile we all want maximum uptime for our software systems, failures and downtimes are inevitable. However, these can be minimized and quickly resolved through comprehensive, robust and well-designed fault tolerance mechanisms. This Nugget Friday looks at how to leverage MicroProfile Fault Tolerance. We look at how it operates and explore how its features can help developers address common failure scenarios. Let's dig in!
The Problem
Network glitches, service outages, and resource exhaustion can cause cascading failures that bring down entire systems in the most unexpected of ways. Without proper fault tolerance mechanisms, a single failing component can create a domino effect of failures across your application. Traditional error handling mechanisms, such as try-catch blocks, aren't sufficient for building truly resilient, fault tolerating systems, especially when dealing with complex, distributed systems . To address this need, a more advanced and systematic approach to fault tolerance is essential to keep applications running smoothly.
The Solution: MicroProfile Fault Tolerance
MicroProfile Fault Tolerance provides an annotation-driven approach to building resilient software applications. It offers a suite of powerful annotations that implement common fault tolerance patterns, making it easy to handle failures gracefully and prevent system-wide catastrophic outages.
Let's look at the core features of MicroProfile Fault Tolerance.
1. Circuit Breaker Pattern (@CircuitBreaker)
Think of a circuit breaker like its electrical counterpart - it prevents system overload by "tripping" when too many failures occur. Using it is as simple as:
@CircuitBreaker(requestVolumeThreshold = 4,
failureRatio = 0.5,
delay = 1000,
successThreshold = 2)
public Connection getDatabaseConnection() {
return connectionPool.getConnection();
}
This circuit breaker will:
- Monitor the last 4 requests (requestVolumeThreshold)
- Open if 50% or more requests fail (failureRatio)
- Stay open for 1 second (delay)
- Require 2 successful test requests to close again (successThreshold)
2. Retry Pattern (@Retry)
When dealing with transient failures, sometimes all you need is to try again:
@Retry(maxRetries = 3,
delay = 200,
jitter = 100,
retryOn = {SQLException.class, TimeoutException.class})
public List<Customer> getCustomers() {
return customerService.fetchCustomers();
}
This configuration will:
- Attempt the operation up to 3 additional times
- Wait 200ms between retries
- Add random jitter of ±100ms to prevent thundering herd problems
- Only retry on specific exceptions
3. Timeout Pattern (@Timeout)
Never let your operations hang indefinitely:
@Timeout(value = 500, unit = ChronoUnit.MILLIS)
public Weather getWeatherData() {
return weatherService.getCurrentConditions();
}
This ensures the operation will fail fast if it takes longer than 500 milliseconds.
4. Bulkhead Pattern (@Bulkhead)
Isolate failures by limiting concurrent executions:
@Bulkhead(value = 5, waitingTaskQueue = 8)
@Asynchronous
public Future<Response> serviceCall() {
// Service implementation
}
This configuration:
- Limits concurrent executions to 5
- Maintains a waiting queue of up to 8 tasks
- When used with @Asynchronous, implements thread pool isolation
5. Fallback Pattern (@Fallback)
Always have a Plan B:
@Fallback(fallbackMethod = "getCachedCustomers")
public List<Customer> getCustomers() {
return customerService.fetchCustomers();
}
private List<Customer> getCachedCustomers() {
return customerCache.getCustomers();
}
When the primary method fails, the fallback method provides an alternative solution.
Why You Should Care
Building resilient applications is essential for modern distributed systems, and MicroProfile Fault Tolerance provides a powerful, simplified solution to achieve this. These features discussed above address common failure points in distributed architectures, empowering developers to create reliable, failure-resistant applications effortlessly. More precisely, they offer:
- Simplified Resilience: These patterns are battle-tested solutions to common distributed system problems. Having them available as simple annotations makes it easy to build reliable, resilient applications.
- Declarative Approach: The annotation-based approach separates fault tolerance concerns from business logic, making code cleaner and more maintainable.
- Configurability: All aspects can be configured via MicroProfile Config, allowing for environment-specific tuning without code changes.
- Metrics Integration: When used with MicroProfile Metrics, you get automatic monitoring of your fault tolerance mechanisms, helping you understand system behaviour.
Advanced Features of MicroProfile Fault Tolerance
Combining Annotations
The real power of MicroProfile Fault Tolerance comes from combining these patterns. For example:
@Retry(maxRetries = 2)
@Timeout(500)
@CircuitBreaker(requestVolumeThreshold = 4, failureRatio = 0.5)
@Fallback(fallbackMethod = "getBackupData")
public Data getServiceData() {
return service.getData();
}
This creates a robust operation that:
- Times out after 500ms
- Retries up to 2 times
- Opens a circuit breaker if too many failures occur
- Falls back to a backup method if all else fails
Configuration Override
One of the most powerful features of MicroProfile Fault Tolerance is its flexible configuration system. All fault tolerance parameters can be overridden without changing code, following a well-defined precedence order. How so? Let's take a look.
1. Method-Level Override
The most specific configuration targets a particular method in a specific class:
# Format: <classname>/<methodname>/<annotation>/<parameter>
com.example.MyService/getCustomers/Timeout/value=2000
This would override the retry and timeout settings specifically for the getCustomers method in the MyService class.
2. Class-Level Override
You can configure all occurrences of an annotation within a specific class:
# Format: <classname>/<annotation>/<parameter>
com.example.MyService/Retry/maxRetries=3
com.example.MyService/CircuitBreaker/delay=2000
This applies to all methods in MyService that use @Retry or @CircuitBreaker.
3. Global Override
For application-wide configuration, you can set parameters globally:
# Format: <annotation>/<parameter>
Retry/maxRetries=2
Timeout/value=1000
CircuitBreaker/failureRatio=0.75
These settings apply to all uses of these annotations across your application unless overridden by more specific configurations.
Configuration Precedence
When multiple configurations exist, they follow this precedence order (highest to lowest):
-
Method-level configuration
-
Class-level configuration
-
Global configuration
-
Annotation values in code
For example:
@Retry(maxRetries = 3)
public class MyService {
@Retry(maxRetries = 5)
public Customer getCustomer(long id) {
// Implementation
}
}
// Configuration files:
MyService/Retry/maxRetries=10 // Class level
MyService/getCustomer/Retry/maxRetries=7 // Method level
Retry/maxRetries=1 // Global level
In this scenario:
-
The getCustomer method will use maxRetries=7 (method level wins)
-
Other @Retry methods in MyService will use maxRetries=10 (class level wins)
-
@Retry methods in other classes will use maxRetries=1 (global setting)
Disabling Fault Tolerance Features
You can also disable fault tolerance features at various levels:
# Disable all fault tolerance except @Fallback
MP_Fault_Tolerance_NonFallback_Enabled=false
# Disable specific annotations globally
CircuitBreaker/enabled=false
# Disable for specific class
com.example.MyService/Retry/enabled=false
# Disable for specific method
com.example.MyService/getCustomer/Timeout/enabled=false
Important Considerations
-
Config Source Priority: Remember that MicroProfile Config's standard priority rules apply to these properties. For example, system properties override environment variables.
-
Runtime Changes: Most configuration changes require application restart to take effect. Properties like MP_Fault_Tolerance_NonFallback_Enabled are only read at application startup.
-
Valid Values: When setting boolean properties (like enabled), only true or false are valid values. Using other values results in non-portable behavior.
-
Property Validation: Invalid configuration properties (referencing non-existent methods or invalid parameter values) are ignored rather than causing errors.
This flexible configuration system allows you to:
-
Fine-tune fault tolerance behavior for different environments (dev, test, prod)
-
Adjust timeouts and retry policies based on real-world performance
-
Disable fault tolerance features when running behind service meshes that provide their own resilience features
-
Quick incident response by adjusting parameters without code changes
Caveats
- Order Matters: When combining annotations, be aware that they're processed in a specific order: Timeout → Retry → Circuit Breaker → Bulkhead → Fallback.
- Resource Consumption: While these patterns provide resilience, they can consume additional resources (threads, memory). Monitor your application to ensure you're not creating new problems.
- Testing: Fault tolerance scenarios can be tricky to test. Be sure to carry our comprehensive tests to ensure the annotations work as expected.
Conclusions
MicroProfile Fault Tolerance provides a powerful, standardised way to build resilient, cloud native applications. Its annotation-based approach makes it easy to implement sophisticated fault tolerance patterns while keeping your code clean and maintainable. Whether you're building new microservices or hardening existing ones, these patterns should be part of your toolkit to help maximize uptime and quick downtime resolution.
Happy Coding!
Related Posts
Jakarta EE Media & Community Challenge - Winning Entries: Part 3
Published on 25 Nov 2024
by Chiara Civardi
0 Comments
The Jakarta EE Media and Community Challenge initiated by Payara celebrates the innovation and creativity that thrives within the Jakarta EE community. Designed as a platform to inspire, educate and showcase collaboration, the competition ...
Getting Started with Observability in Jakarta EE Applications: Why Observability Matters
Published on 22 Nov 2024
by Luqman Saeed
0 Comments