|| Rate this content|
HP High Availability and Disaster Recovery Software - HP-UX/SG: Network Problem Caused Both Nodes to Crash Even Though QS Log Shows it Provided Lock to One of Them
There is a 2 node cluster, where node1 is in site1, and node2 is in site2. They use a Quorum Server in site3. A network problem took place in site1. When the reformation took place, the nodes requested the lock and QS granted it to node1. However, still both nodes TOC.
What caused this? Can this be prevented?
The QS log shows node1 got the lock and node2 was denied:
From the OLDsyslog of node2:
From the FR log on node1:
This says that the nodes both requested lock during reformation and QS gave it to node1 and denied it from node2. However, per node1's FR log, it requested the lock but never seems to receive it acknowledging the QS is busy. This can only happen if somehow the QS subnet between node1 and QS was lost right after node1 sent request to the lock (therefore the lock request went through but the lock reply from QS to node1 was not received, which is very unlikely) or the network problems on site 1 caused the subnet for the node1 - QS connection to be mono-directional (i.e, node1 to QS is OK but QS to node1 is broken) therefore, explaining the success of lock request from node1 but failure of receipt of lock from QS.
Checking the xportshow output, one can see the status of the QS connection:
There are 112 bytes in the send queue, which means that the QS has not acknowledged the data on the IP level. This is a strong indication that the network connection was broken in one direction: The node could send messages to the QS but did not receive anything back.
In fact it turns out that the latter is what took place.
So in summary this is "dual failure" situation.
Given that Serviceguard is a "Single Point of Failure" High Availability product, it cannot handle 2 or more simultaneous (or within a short time, that second failure happens before the first failure is handled by SG) failures, the only way to prevent such situations is to prevent "dual failure" situations from happening. This can only be accomplished by designing the cluster with enough redundancies to prevent such situations.