Nugget Friday - Building Resilient Microservices with MicroProfile Fault Tolerance

Published on 08 Nov 2024

While we all want maximum uptime for our software systems, failures and downtimes are inevitable. However, these can be minimized and quickly resolved through comprehensive, robust and well-designed fault tolerance mechanisms. This Nugget Friday looks at how to leverage MicroProfile Fault Tolerance. We look at how it operates and explore how its features can help developers address common failure scenarios. Let's dig in!

The Problem

Network glitches, service outages, and resource exhaustion can cause cascading failures that bring down entire systems in the most unexpected of ways. Without proper fault tolerance mechanisms, a single failing component can create a domino effect of failures across your application. Traditional error handling mechanisms, such as try-catch blocks, aren't sufficient for building truly resilient, fault tolerating systems, especially when dealing with complex, distributed systems . To address this need, a more advanced and systematic approach to fault tolerance is essential to keep applications running smoothly.

The Solution: MicroProfile Fault Tolerance

MicroProfile Fault Tolerance provides an annotation-driven approach to building resilient software applications. It offers a suite of powerful annotations that implement common fault tolerance patterns, making it easy to handle failures gracefully and prevent system-wide catastrophic outages.

Let's look at the core features of MicroProfile Fault Tolerance.

1. Circuit Breaker Pattern (@CircuitBreaker)

Think of a circuit breaker like its electrical counterpart - it prevents system overload by "tripping" when too many failures occur. Using it is as simple as:

@CircuitBreaker(requestVolumeThreshold = 4,
                failureRatio = 0.5,
                delay = 1000,
                successThreshold = 2)
public Connection getDatabaseConnection() {
    return connectionPool.getConnection();

}

This circuit breaker will:

Monitor the last 4 requests (requestVolumeThreshold)
Open if 50% or more requests fail (failureRatio)
Stay open for 1 second (delay)
Require 2 successful test requests to close again (successThreshold)

2. Retry Pattern (@Retry)

When dealing with transient failures, sometimes all you need is to try again:

@Retry(maxRetries = 3,
       delay = 200,
       jitter = 100,
       retryOn = {SQLException.class, TimeoutException.class})
public List<Customer> getCustomers() {
    return customerService.fetchCustomers();

}

This configuration will:

Attempt the operation up to 3 additional times
Wait 200ms between retries
Add random jitter of ±100ms to prevent thundering herd problems
Only retry on specific exceptions

3. Timeout Pattern (@Timeout)

Never let your operations hang indefinitely:

@Timeout(value = 500, unit = ChronoUnit.MILLIS)
public Weather getWeatherData() {
    return weatherService.getCurrentConditions();

}

This ensures the operation will fail fast if it takes longer than 500 milliseconds.

4. Bulkhead Pattern (@Bulkhead)

Isolate failures by limiting concurrent executions:

@Bulkhead(value = 5, waitingTaskQueue = 8)
@Asynchronous
public Future<Response> serviceCall() {
    // Service implementation

}

This configuration:

Limits concurrent executions to 5
Maintains a waiting queue of up to 8 tasks
When used with @Asynchronous, implements thread pool isolation

5. Fallback Pattern (@Fallback)

Always have a Plan B:

@Fallback(fallbackMethod = "getCachedCustomers")
public List<Customer> getCustomers() {
    return customerService.fetchCustomers();

}
private List<Customer> getCachedCustomers() {
    return customerCache.getCustomers();

}

When the primary method fails, the fallback method provides an alternative solution.

Why You Should Care

Building resilient applications is essential for modern distributed systems, and MicroProfile Fault Tolerance provides a powerful, simplified solution to achieve this. These features discussed above address common failure points in distributed architectures, empowering developers to create reliable, failure-resistant applications effortlessly. More precisely, they offer:

Simplified Resilience: These patterns are battle-tested solutions to common distributed system problems. Having them available as simple annotations makes it easy to build reliable, resilient applications.
Declarative Approach: The annotation-based approach separates fault tolerance concerns from business logic, making code cleaner and more maintainable.
Configurability: All aspects can be configured via MicroProfile Config, allowing for environment-specific tuning without code changes.
Metrics Integration: When used with MicroProfile Metrics, you get automatic monitoring of your fault tolerance mechanisms, helping you understand system behaviour.

Advanced Features of MicroProfile Fault Tolerance

Combining Annotations

The real power of MicroProfile Fault Tolerance comes from combining these patterns. For example:

@Retry(maxRetries = 2)
@Timeout(500)
@CircuitBreaker(requestVolumeThreshold = 4, failureRatio = 0.5)
@Fallback(fallbackMethod = "getBackupData")
public Data getServiceData() {
    return service.getData();

}

This creates a robust operation that:

Times out after 500ms
Retries up to 2 times
Opens a circuit breaker if too many failures occur
Falls back to a backup method if all else fails

Configuration Override

One of the most powerful features of MicroProfile Fault Tolerance is its flexible configuration system. All fault tolerance parameters can be overridden without changing code, following a well-defined precedence order. How so? Let's take a look.

1. Method-Level Override

The most specific configuration targets a particular method in a specific class:

# Format: <classname>/<methodname>/<annotation>/<parameter>
com.example.MyService/getCustomers/Timeout/value=2000

This would override the retry and timeout settings specifically for the getCustomers method in the MyService class.

2. Class-Level Override

You can configure all occurrences of an annotation within a specific class:

# Format: <classname>/<annotation>/<parameter>
com.example.MyService/Retry/maxRetries=3
com.example.MyService/CircuitBreaker/delay=2000

This applies to all methods in MyService that use @Retry or @CircuitBreaker.

3. Global Override

For application-wide configuration, you can set parameters globally:

# Format: <annotation>/<parameter>
Retry/maxRetries=2
Timeout/value=1000
CircuitBreaker/failureRatio=0.75

These settings apply to all uses of these annotations across your application unless overridden by more specific configurations.

Configuration Precedence

When multiple configurations exist, they follow this precedence order (highest to lowest):

Method-level configuration
Class-level configuration
Global configuration
Annotation values in code

For example:

@Retry(maxRetries = 3)
public class MyService {
    @Retry(maxRetries = 5)
    public Customer getCustomer(long id) {
        // Implementation
    }
}

// Configuration files:
MyService/Retry/maxRetries=10                    // Class level
MyService/getCustomer/Retry/maxRetries=7         // Method level
Retry/maxRetries=1                               // Global level

In this scenario:

The getCustomer method will use maxRetries=7 (method level wins)
Other @Retry methods in MyService will use maxRetries=10 (class level wins)
@Retry methods in other classes will use maxRetries=1 (global setting)

Disabling Fault Tolerance Features

You can also disable fault tolerance features at various levels:

# Disable all fault tolerance except @Fallback
MP_Fault_Tolerance_NonFallback_Enabled=false

# Disable specific annotations globally
CircuitBreaker/enabled=false

# Disable for specific class
com.example.MyService/Retry/enabled=false

# Disable for specific method
com.example.MyService/getCustomer/Timeout/enabled=false

Important Considerations

Config Source Priority: Remember that MicroProfile Config's standard priority rules apply to these properties. For example, system properties override environment variables.
Runtime Changes: Most configuration changes require application restart to take effect. Properties like MP_Fault_Tolerance_NonFallback_Enabled are only read at application startup.
Valid Values: When setting boolean properties (like enabled), only true or false are valid values. Using other values results in non-portable behavior.
Property Validation: Invalid configuration properties (referencing non-existent methods or invalid parameter values) are ignored rather than causing errors.

This flexible configuration system allows you to:

Fine-tune fault tolerance behavior for different environments (dev, test, prod)
Adjust timeouts and retry policies based on real-world performance
Disable fault tolerance features when running behind service meshes that provide their own resilience features
Quick incident response by adjusting parameters without code changes

Caveats

Order Matters: When combining annotations, be aware that they're processed in a specific order: Timeout → Retry → Circuit Breaker → Bulkhead → Fallback.
Resource Consumption: While these patterns provide resilience, they can consume additional resources (threads, memory). Monitor your application to ensure you're not creating new problems.
Testing: Fault tolerance scenarios can be tricky to test. Be sure to carry our comprehensive tests to ensure the annotations work as expected.

Conclusions

MicroProfile Fault Tolerance provides a powerful, standardised way to build resilient, cloud native applications. Its annotation-based approach makes it easy to implement sophisticated fault tolerance patterns while keeping your code clean and maintainable. Whether you're building new microservices or hardening existing ones, these patterns should be part of your toolkit to help maximize uptime and quick downtime resolution.

Happy Coding!