How to Improve Domain Data Grid Performance

Photo of Fabio Turizo by Fabio Turizo

One of the cornerstones of any modern Payara Platform architecture is the use of the Domain Data Grid. The Domain Data Grid allows multiple Payara Server or Payara Micro instances to join and form a robust cluster of interchangeable nodes that can share data between each other and grant High Availability and Failover capabilities to any applications deployed in the cluster. 

The Domain Data Grid in the Payara Platform builds upon a powerful open source technology in the Hazelcast IMDG Community Edition, allowing the easy creation of distributed applications for multiple environments.

The data grid is enabled by default on any of the products of the Payara Platform version 5.x, and any administrator user can view evidence of its composition by looking at the following message outputs in the corresponding server logs:

[2021-01-20T17:56:58.408-0500] [] [INFO] [] [fish.payara.nucleus.cluster.PayaraCluster] [tid: _ThreadID=80 _ThreadName=Executor-Service-4] [timeMillis: 1611183418408] [levelValue: 800] [[
Data Grid Status
Payara Data Grid State: DG Version: 35 DG Name: production DG Size: 2
Instances: {
DataGrid: production Group: MicroShoal Name: Instance-1 Lite: false This: true UUID: f0b6d054-f017-40a5-ad50-732fb6e200b9 Address: /192.168.1.148:6900
DataGrid: production Group: MicroShoal Name: Instance-2 Lite: false This: false UUID: f0b6d054-f017-40a5-ad50-732fb6e200b9 Address: /192.168.1.150:6900
}]]

You can configure the Domain Data Grid in multiple ways, but its primary behaviour is pretty straightforward: When an existing data grid discovers a new server instance, it joins the data grid seamlessly and without any direct user intervention. Network latency, JVM settings, and Hazelcast settings all affect the time it takes to join the grid and become an official Hazelcast cluster member. Adjusting Hazelcast settings makes it possible to improve the time required for an instance to join the cluster, which can improve the overall time that it takes for a freshly initialized Domain Data Grid to be ready to serve user requests.

The following system properties, which are used internally by Hazelcast, can be used to control the time that the instance will wait until it joins the grid:

Property Description Default Value
hazelcast.wait.seconds.before.join Wait time in seconds before the instance is allowed to join the grid 5
hazelcast.max.wait.seconds.before.join Maximum time in seconds that the instance will take to join the grid 20

 

It might be a bit confusing to understand this properties, so to help you understand them here what you need to know:

  • Although Hazelcast's cluster strategy is very complex, the main thing you need to know is that every Hazelcast cluster requires a master node or member, which coordinates how each instance joins up the cluster.
     
    • In Payara Server, the DAS will always function as the master node in the default Domain Data Grid configuration.
    • In Payara Micro, the first instance that starts will set itself up as the master node; and any new instance launched within the same network will attempt to join the grid by coordinating with the master node.
  • When a new server instance starts up, Hazelcast tries to discover if there's an existing master node that is reachable, in which case it will try to contact it and join the cluster that this node is managing.

The two properties listed above are used to form a window of time named the pre-join phase, which spans from the first time when the master node receives the first join request from the instance and ends one of the following happen:

  • No new instances request to join the cluster for hazelcast.wait.seconds.before.join seconds
  • The time defined in hazelcast.max.wait.seconds.before.join ellapses.

Once the pre-join phase ends, the master node will move to the official join phase, which will make the master join ALL instances that completed their corresponding pre-join phases (or are still trying to join in the case they failed their join attempts earlier).

With all of this in mind, to optimize the time it takes a server instance to join the Data Grid, you should set both values as low as possible, like 1 to 5 seconds. Here's an example to set these properties when launching a new Payara Micro instance:

java -Dhazelcast.wait.seconds.before.join=1 -Dhazelcast.max.wait.seconds.before.join=5 -jar payara-micro.jar

Keep in mind that such an optimization would work as long as the number of instances that try to join the grid at any given time is not very high. If for example, 20 new instances were started at the same time, then the master's pre-join phase would be too short to allow these instances to join at (around) the same time. Shorter times should be considered when the number of instances joining is expected to remain consistently small over your environment's lifetime.

At the time of this writing, it is not recommended to set the hazelcast.wait.seconds.before.join property to 0 (effectively disabling the wait time for new join requests) since there are known bugs on Hazelcast that cause unwanted side effects. We expect that these will be fixed on future releases of Hazelcast and will then be quickly integrated into the Payara Platform as well.
 
 Payara Platform  Download Here 

 

Comments