The Health Check Service In-Depth - Payara Server

26 May 2016

Recent versions of Payara Server provide the Health Check Service for automatic self-monitoring, in order to detect future problems as soon as possible. When enabled, the Health Check Service periodically checks some low level metrics and logs warnings whenever it detects that a threshold is not met. All of these automatic checks are very lightweight and run with a negligible impact on performance.

 

The Health Check Service was introduced in Payara Server 161 and some new metrics have been added in Payara Server 162. It is also available in Payara Micro edition, and we will cover the details on how to make use of it in another post soon. 

 

 

The Health Check Service periodically checks several metrics, such as CPU and memory usage. If any of these metrics exceed a configurable threshold, then a message is logged to the server’s log file. This helps to rapidly detect problems or work out what happened after problems have occurred. For most of the checks, threshold configurations can be specified in 3 different levels: good, warning and critical. When the threshold is not met, the Health Check Service will in turn log a message with the level of Warning, Error or Critical respectively and will continue to do so after some time until the threshold is met.

 

This is how the alerts might look like in a log viewer:

 

logviewer_demo.png

 (click image to enlarge)

Let's get started

The Health Check Service is not enabled by default, but can be easily enabled by the asadmin healthcheck-configure command:

asadmin> healthcheck-configure --enabled=true --dynamic=true

The argument --dynamic=true is necessary to turn on the service for a running server, otherwise it would only be applied after a server restart.

We can check the actual configuration of the Health Check Service and all its metrics using asadmin get-healthcheck-configuration:

asadmin> get-healthcheck-configuration

This is an example output from the above command, after some Health Check services have been configured:

 
Health Check Service Configuration is enabled?: true
Below are the list of configuration details of each checker listed by its name.
 
Name  Enabled  Time  Unit    
GC    true    10    SECONDS 
 
Name  Enabled  Time  Unit     thresholdPercentage  retryCount 
HT    false     10   SECONDS  95                   3          
 
Name  Enabled  Time  Unit     Critical Threshold  Warning Threshold  Good Threshold 
CPU   true    10    SECONDS  40                  20                 2              
HP    false    8    SECONDS  -                   -                  -              
MM    true    7     SECONDS  -                   -                  -

 

 

And this is an example of log messages produced by the above configuration:

 

Info:   CPU:Health Check Result:[[status=GOOD, message='CPU%: 23.28, Time CPU used: 1 seconds 163 milliseconds'']']
Info:   GC:Health Check Result:[[status=GOOD, message='1 times Young GC (PS Scavenge) after 72 milliseconds'']']
Severe: MMEM:Health Check Result:[[status=CRITICAL, message='Physical Memory Used: 7 Gb - Total Physical Memory: 7 Gb - Memory Used%: 96.74%'']']
Info:   HEAP:Health Check Result:[[status=GOOD, message='heap: init: 124 Mb, used: 203 Mb, committed: 460 Mb, max.: 910 Mbheap%: 22.0%'']']

 

 logviewer_get_started.png

  (click image to enlarge)

 

How to configure monitoring of individual metrics

In order to configure the Health Check service to monitor a specific metric, a service for that metric must be configured first. This is done by using the asadmin healthcheck-configure-service and healthcheck-configure-service-threshold commands for each metric service:

 

asadmin > healthcheck-configure-service --serviceName=<service.name> --enabled=true --dynamic=true --time=<value> --unit=MICROSECONDS|MILLISECONDS|SECONDS|MINUTES|HOURS|DAYS
asadmin > healthcheck-configure-service-threshold --serviceName=<service.name> --dynamic=true --thresholdCritical=90 --thresholdWarning=50 --thresholdGood=0

 

 

How do we find the names of each individual metric service? Simple. Run the command asadmin healthcheck-list-services, and it will present the following results:

Available Health Check Services:

  •  healthcheck-cpool
  •  healthcheck-cpu
  •  healthcheck-gc
  •  healthcheck-heap
  •  healthcheck-threads
  •  healthcheck-machinemem

So, to configure the frequency of executing the health checks for the CPU usage to be every 5 hours, we would run the following asadmin command:

asadmin > healthcheck-configure-service --serviceName=healthcheck-cpu --enabled=true --dynamic=true --time=5 --unit=HOURS

Monitoring of CPU and Memory Metrics

The health check service offers 3 specific metric services that check periodically for CPU, Physical Memory and Java Memory Heap usage (for the corresponding domain):

  • CPU Usage
    The service calculates how much the Payara server JVM process has utilized the CPU within the monitoring interval and prints the average percentage and the total amount of time in milliseconds that the CPU was used by the Payara server.
     
  • Memory Usage
    The service queries the total amount of physical memory available to the machine, the current amount of memory being used and calculates the percentage of memory use.
     
  • Java Heap Usage
    The service queries the initial amount of memory requested for the Heap space, the current amount being used, the current amount being committed (available for future use) and the maximum amount of memory available to be used for the Heap space and calculates the current percentage of memory use for the Heap.

For all three of these metrics, the service will run periodically and it will compare the calculated values (% of use) to the threshold configuration values (GOOD, WARNING and CRITICAL) and issue alerts in the standard logging mechanism configured for the domain.

The following table summarizes the boundaries for each alert severity:

 

Lower Boundary (Including equal)
Upper Boundary (Not equal)
Result
0 GOOD-Threshold No event is logged
GOOD-Threshold WARNING-Threshold GOOD Event logged
WARNING-Threshold CRITICAL-Threshold WARNING Event logged
CRITICAL-Threshold 100 CRITICAL Event logged

 

 

For example, if the CPU metric checker calculates the value of 75.6%, with these thresholds in the configuration

  • GOOD: 20
  • WARNING: 45
  • CRITICAL: 90

then a message similar to this one would be written to the logging mechanism of the Payara Server domain:

2016-05-24T03:52:28.690+0000] [Payara 4.1] [INFO] [] [fish.payara.nucleus.healthcheck.HealthCheckService] [tid: _ThreadID=72 _ThreadName=healthcheck-service-3] [timeMillis: 1464061948690] [levelValue: 800] [[CPUC:Health Check Result:[[status=WARNING, message='CPU%: 75.6, Time CPU used: 267 milliseconds'']']]]

 

logviewer_mem_cpu.png

 

In order to configure the threshold range for each metric, the following asadmin command must be run after enabling the metric service:

 
asadmin > healthcheck-configure-service-threshold --serviceName=<service.name> --dynamic=[true|false] --thresholdCritical=<CRITICAL> --thresholdWarning=<WARNING> --thresholdGood=<GOOD>

 

The domain would respond with the following output:

 
Critical threshold for healthcheck-cpu service is set with value 95.
Warning threshold for healthcheck-cpu service is set with value 75.
Good threshold for healthcheck-cpu service is set with value 60.

 

And the domain.xml file would get modified with the following configuration (considering it was configured previously with a frequency of 1 minute):

 

 

<health-check-service-configuration enabled="true">
     <cpu-usage-checker unit="MINUTES" time="1" enabled="true">
          <property name="threshold-critical" value="95"></property>
          <property name="threshold-warning" value="75"></property>
          <property name="threshold-good" value="60"></property>
     </cpu-usage-checker>
</health-check-service-configuration>

 

 

Remember that in order to enable this change in configuration immediately, the dynamic option must be set to true. Also, if there's a previous threshold configuration for the metric service specified the command will override the already configured values. If no values for the thresholds are present in arguments to the command, the following default values will be used:

  • GOOD : 0
  • WARNING: 50
  • CRITICAL: 90

Monitoring Hogging Threads

The health check service checks periodically to detect all running threads that are "hogging" the CPU in a Payara Server domain. The checker computes the percentage of CPU use for each active thread with the ratio of elapsed CPU time to the checker execution interval and compares this percentage to a preset threshold value. In case the current percentage exceeds that value, a CRITICAL message event similar to this one will be logged out for each detected thread:

 

[2016-05-24T21:11:36.579+0000] [Payara 4.1] [SEVERE] [] [fish.payara.nucleus.healthcheck.HealthCheckService] [tid: _ThreadID=71 _ThreadName=healthcheck-service-3] [timeMillis: 1464124296579] [levelValue: 1000] [[HOGT:Health Check Result:[[status=CRITICAL, message='Thread with <id-name>: 145-testing-thread-1 is a hogging thread for the last 59 seconds 999 milliseconds'']']]]
logviewer_threads.png
  (click image to enlarge)
 

To configure how this metric is evaluated, the service offers the following 2 options:

  • threshold-percentage: Defines the minimum percentage needed for the thread to count as being hogged CPU-wise. The percentage is calculated with the ratio of elapsed CPU time to checker execution interval. Its default value is 95.
  • retry-count: Represents the count value that should be reached by a hogged thread in order to give health check messages to the user. Its default value is 3.

For the moment, the only way to configure these 2 options is to define them as attributes of the hogging-threads-checker configuration element in the domain.xml file:

 

<health-check-service-configuration enabled="true">
      <machine-memory-usage-checker unit="MINUTES" time="1" enabled="true"></machine-memory-usage-checker>
      <hogging-threads-checker unit="MINUTES" time="1" enabled="true" threshold-percentage="65" retry-count="10"></hogging-threads-checker>
 </health-check-service-configuration>

 

 

and restart the domain for this changes to take effect. You can check them with the asadmin command get-healthcheck-configuration:

 

asadmin > get-healthcheck-configuration
 
Name  Enabled  Time  Unit     Threshold Percentage  Retry Count
HOGT  true     1     MINUTES  65                    10
 
Name  Enabled  Time  Unit     Critical Threshold  Warning Threshold  Good Threshold
CPUC  false    1     MINUTES  95                  80                 60

 

In the future versions of Payara server, it will be possible to configure these 2 options using the asadmin command as well.

 

Monitoring GC activity

The garbage collection Health Check Service checks periodically for any garbage collections. It calculates and prints out how many times garbage collections were executed within the time elapsed since the last check. The log messages include the following information for both young and older generation garbage collection types:

  • number of GC executions since last check
  • total time spent in GC cycles
  • name of the garbage collector algorithm
  • severity (status) of the message

The following is how the log message might look:

Info:   GC:Health Check Result:[[status=GOOD, message='4 times Young GC (PS Scavenge) after 16 milliseconds'']', [status=GOOD, message='4 times Old GC (PS MarkSweep) after 814 milliseconds'']']

 

logviewer_gc.png

  (click image to enlarge)

 

The above informs us that since the last check there were 4 GC executions on the young generation which, together, took 16 milliseconds. There were also 4 GC executions on the old generation which, together, took 814 milliseconds.

We can also see the status of the alert, which is GOOD in both cases. The status changes to more critical values when the GC activity takes a bigger portion of the CPU time.

 

The GC health check service can be enabled using asadmin healthcheck-configure-service command in the same way as other metric checkers, with healthcheck-gc service name:

asadmin> healthcheck-configure-service --serviceName=healthcheck-gc --enabled=true --dynamic=true --time=5 --unit=MINUTES

The above will schedule the check to be executed every 5 minutes, and will write the following into the configuration in the domain.xml file:

 

<health-check-service-configuration>
        <garbage-collector-checker name="GC" unit="SECONDS" time="5" enabled="true"></garbage-collector-checker>
</health-check-service-configuration>

 

 

Monitoring of JDBC Connection pools

The connection pool health check service checks how many connections in a JDBC connection pool are used and how many remain free. If the ratio of used connections hits the configured thresholds, it will print an appropriate alert message to the log.

This is an example of an alert message when the 87.5% of connections in the pool SamplePool are used:

Warning:   CONP:Health Check Result:[[status=WARNING, message='SamplePool Usage (%): 87.50'']']

 

If there are multiple connection pools being used, the alert will include information about all of them:

Warning:   CONP:Health Check Result:[[status=GOOD, message='DerbyPool Usage (%): 25.00'']', [status=WARNING, message='SamplePool Usage (%): 87.50'']']

logviewer_cpool_multiple.png

 

Monitoring of connection pools can be enabled using healthcheck-cpool as service name:

asadmin> healthcheck-configure-service --serviceName=healthcheck-cpool --enabled=true --dynamic=true --time=5 --unit=SECONDS

And the thresholds can also be configured as for other health checks:

 
asadmin> healthcheck-configure-service-threshold --serviceName=healthcheck-cpool --thresholdCritical=95 --thresholdWarning=75 --thresholdGood=0
<health-check-service-configuration>
        <connection-pool-checker unit="SECONDS" time="5" enabled="true">
          <property name="threshold-critical" value="95"></property>
          <property name="threshold-warning" value="75"></property>
          <property name="threshold-good" value="0"></property>
        </connection-pool-checker>
</health-check-service-configuration>

 

More information

We intended to give you a thorough overview of what the Health Check Service provides and how all the monitoring metrics can be configured. For more information, you can always refer to the Health Check Service documentation page, which provides up-to-date information and will keep being updated when new metrics are added in future versions of Payara Server.

 

The Health Check service is also available in Payara Micro edition, and we will cover the details on how to make use of it in one of the future posts soon.

 

 

 Download Payara Server  

 

 

Comments

Subscribe