Recent versions of Payara Server provide the HealthCheck Service for automatic self-monitoring, in order to detect future problems as soon as possible. When enabled, the HealthCheck Service periodically checks some low level metrics and logs warnings whenever it detects that a threshold is not met. All of these automatic checks are very lightweight and run with a negligible impact on performance.
The HealthCheck Service was introduced in Payara Server 161 and some new metrics have been added in Payara Server 162. It is also available in Payara Micro edition, and we will cover the details on how to make use of it in another post soon.
The HealthCheck Service periodically checks several metrics, such as CPU and memory usage. If any of these metrics exceed a configurable threshold, then a message is logged to the server’s log file. This helps to rapidly detect problems or work out what happened after problems have occurred. For most of the checks, threshold configurations can be specified in 3 different levels: good, warning and critical. When the threshold is not met, the HealthCheck Service will in turn log a message with the level of Warning, Error or Critical respectively and will continue to do so after some time until the threshold is met.
This is how the alerts might look like in a log viewer:
(click image to enlarge)
Let's get started
The HealthCheck Service is not enabled by default, but can be easily enabled by the asadmin
asadmin> healthcheck-configure --enabled=true --dynamic=true
--dynamic=true is necessary to turn on the service for a running server, otherwise it would only be applied after a server restart.
We can check the actual configuration of the HealthCheck Service and all its metrics using asadmin
This is an example output from the above command, after some HealthCheck services have been configured:
And this is an example of log messages produced by the above configuration:
(click image to enlarge)
How to configure monitoring of individual metrics
In order to configure the HealthCheck service to monitor a specific metric, a service for that metric must be configured first. This is done by using the asadmin
healthcheck-configure-service-threshold commands for each metric service:
How do we find the names of each individual metric service? Simple. Run the command
asadmin healthcheck-list-services, and it will present the following results:
Available HealthCheck Services:
So, to configure the frequency of executing the health checks for the CPU usage to be every 5 hours, we would run the following asadmin command:
asadmin > healthcheck-configure-service --serviceName=healthcheck-cpu --enabled=true --dynamic=true --time=5 --unit=HOURS
Monitoring of CPU and Memory Metrics
The health check service offers 3 specific metric services that check periodically for CPU, Physical Memory and Java Memory Heap usage (for the corresponding domain):
- CPU Usage
The service calculates how much the Payara server JVM process has utilized the CPU within the monitoring interval and prints the average percentage and the total amount of time in milliseconds that the CPU was used by the Payara server.
- Memory Usage
The service queries the total amount of physical memory available to the machine, the current amount of memory being used and calculates the percentage of memory use.
- Java Heap Usage
The service queries the initial amount of memory requested for the Heap space, the current amount being used, the current amount being committed (available for future use) and the maximum amount of memory available to be used for the Heap space and calculates the current percentage of memory use for the Heap.
For all three of these metrics, the service will run periodically and it will compare the calculated values (% of use) to the threshold configuration values (GOOD, WARNING and CRITICAL) and issue alerts in the standard logging mechanism configured for the domain.
The following table summarizes the boundaries for each alert severity:
Lower Boundary (Including equal)
Upper Boundary (Not equal)
|0||GOOD-Threshold||No event is logged|
|GOOD-Threshold||WARNING-Threshold||GOOD Event logged|
|WARNING-Threshold||CRITICAL-Threshold||WARNING Event logged|
|CRITICAL-Threshold||100||CRITICAL Event logged|
For example, if the CPU metric checker calculates the value of 75.6%, with these thresholds in the configuration
- GOOD: 20
- WARNING: 45
- CRITICAL: 90
then a message similar to this one would be written to the logging mechanism of the Payara Server domain:
2016-05-24T03:52:28.690+0000] [Payara 4.1] [INFO]  [fish.payara.nucleus.healthcheck.HealthCheckService] [tid: _ThreadID=72 _ThreadName=healthcheck-service-3] [timeMillis: 1464061948690] [levelValue: 800] [[CPUC:Health Check Result:[[status=WARNING, message='CPU%: 75.6, Time CPU used: 267 milliseconds'']']]]
In order to configure the threshold range for each metric, the following asadmin command must be run after enabling the metric service:
asadmin > healthcheck-configure-service-threshold --serviceName=<service.name> --dynamic=[true|false] --thresholdCritical=<CRITICAL> --thresholdWarning=<WARNING> --thresholdGood=<GOOD>
For example, if we would like to configure the CPU usage threshold values, we would run the following command:
asadmin > healthcheck-configure-service-threshold --serviceName=healthcheck-cpu --dynamic=true --thresholdCritical=95 --thresholdWarning=75 --thresholdGood=60
The domain would respond with the following output:
Critical threshold for healthcheck-cpu service is set with value 95.
Warning threshold for healthcheck-cpu service is set with value 75.
Good threshold for healthcheck-cpu service is set with value 60.
And the domain.xml file would get modified with the following configuration (considering it was configured previously with a frequency of 1 minute):
Remember that in order to enable this change in configuration immediately, the dynamic option must be set to true. Also, if there's a previous threshold configuration for the metric service specified the command will override the already configured values. If no values for the thresholds are present in arguments to the command, the following default values will be used:
- GOOD : 0
- WARNING: 50
- CRITICAL: 90
Monitoring Hogging Threads
The health check service checks periodically to detect all running threads that are "hogging" the CPU in a Payara Server domain. The checker computes the percentage of CPU use for each active thread with the ratio of elapsed CPU time to the checker execution interval and compares this percentage to a preset threshold value. In case the current percentage exceeds that value, a CRITICAL message event similar to this one will be logged out for each detected thread:
[2016-05-24T21:11:36.579+0000] [Payara 4.1] [SEVERE]  [fish.payara.nucleus.healthcheck.HealthCheckService] [tid: _ThreadID=71 _ThreadName=healthcheck-service-3] [timeMillis: 1464124296579] [levelValue: 1000] [[HOGT:Health Check Result:[[status=CRITICAL, message='Thread with <id-name>: 145-testing-thread-1 is a hogging thread for the last 59 seconds 999 milliseconds'']']]]
To configure how this metric is evaluated, the service offers the following 2 options:
threshold-percentage: Defines the minimum percentage needed for the thread to count as being hogged CPU-wise. The percentage is calculated with the ratio of elapsed CPU time to checker execution interval. Its default value is 95.
retry-count: Represents the count value that should be reached by a hogged thread in order to give health check messages to the user. Its default value is 3.
For the moment, the only way to configure these 2 options is to define them as attributes of the hogging-threads-checker configuration element in the domain.xml file:
and restart the domain for this changes to take effect. You can check them with the asadmin command get-healthcheck-configuration:
In the future versions of Payara server, it will be possible to configure these 2 options using the asadmin command as well.
Monitoring GC activity
The garbage collection HealthCheck Service checks periodically for any garbage collections. It calculates and prints out how many times garbage collections were executed within the time elapsed since the last check. The log messages include the following information for both young and older generation garbage collection types:
- number of GC executions since last check
- total time spent in GC cycles
- name of the garbage collector algorithm
- severity (status) of the message
The following is how the log message might look:
Info: GC:Health Check Result:[[status=GOOD, message='4 times Young GC (PS Scavenge) after 16 milliseconds'']', [status=GOOD, message='4 times Old GC (PS MarkSweep) after 814 milliseconds'']']
(click image to enlarge)
The above informs us that since the last check there were 4 GC executions on the young generation which, together, took 16 milliseconds. There were also 4 GC executions on the old generation which, together, took 814 milliseconds.
We can also see the status of the alert, which is GOOD in both cases. The status changes to more critical values when the GC activity takes a bigger portion of the CPU time.
The GC health check service can be enabled using asadmin
healthcheck-configure-service command in the same way as other metric checkers, with
healthcheck-gc service name:
asadmin> healthcheck-configure-service --serviceName=healthcheck-gc --enabled=true --dynamic=true --time=5 --unit=MINUTES
The above will schedule the check to be executed every 5 minutes, and will write the following into the configuration in the
Monitoring of JDBC Connection pools
The connection pool health check service checks how many connections in a JDBC connection pool are used and how many remain free. If the ratio of used connections hits the configured thresholds, it will print an appropriate alert message to the log.
This is an example of an alert message when the 87.5% of connections in the pool SamplePool are used:
Warning: CONP:Health Check Result:[[status=WARNING, message='SamplePool Usage (%): 87.50'']']
If there are multiple connection pools being used, the alert will include information about all of them:
Warning: CONP:Health Check Result:[[status=GOOD, message='DerbyPool Usage (%): 25.00'']', [status=WARNING, message='SamplePool Usage (%): 87.50'']']
Monitoring of connection pools can be enabled using
healthcheck-cpool as service name:
asadmin> healthcheck-configure-service --serviceName=healthcheck-cpool --enabled=true --dynamic=true --time=5 --unit=SECONDS
And the thresholds can also be configured as for other health checks:
asadmin> healthcheck-configure-service-threshold --serviceName=healthcheck-cpool --thresholdCritical=95 --thresholdWarning=75 --thresholdGood=0
The above commands will create the following configuration in domain.xml:
We intended to give you a thorough overview of what the HealthCheck Service provides and how all the monitoring metrics can be configured. For more information, you can always refer to the HealthCheck Service documentation page, which provides up-to-date information and will keep being updated when new metrics are added in future versions of Payara Server.
The HealthCheck Service is also available in Payara Micro edition, and we will cover the details on how to make use of it in one of the future posts soon.