Print | Rate this content

HP High Availability and Disaster Recovery Software - HP-UX/SG: Network Problem Caused Both Nodes to Crash Even Though QS Log Shows it Provided Lock to One of Them

Issue

Environment:

HP-UX: 11.31.

Serviceguard: 11.18.

There is a 2 node cluster, where node1 is in site1, and node2 is in site2. They use a Quorum Server in site3. A network problem took place in site1. When the reformation took place, the nodes requested the lock and QS granted it to node1. However, still both nodes TOC.

What caused this? Can this be prevented?

Solution

The QS log shows node1 got the lock and node2 was denied:


Jan 13 23:08:14:0:Request for lock /sg/xxx succeeded. New lock owners: node1
Jan 13 23:08:14:0:Request for lock /sg/xxx from applicant node2 denied: lock owned by others.

From the OLDsyslog of node2:


Jan 13 23:08:14 node2 cmcld[20810]: Member node1 timed out me. Removing it from membership.
Jan 13 23:08:14 node2 cmcld[20810]: Lost heartbeat to node1
Jan 13 23:08:14 node2 cmcld[20810]: Resolving quorum with members node2
Jan 13 23:08:14 node2 cmcld[20810]: Attempting to get quorum server lock /sg/xxx. Active members:node2
Jan 13 23:08:14 node2 cmcld[20810]: Membership: membership at 1 is REFORMING (coordinator 2) includes: 2 excludes: 1
Jan 13 23:08:15 node2 cmcld[20810]: Attempt to get quorum server lock /sg/gcu41714 at xx.xx.xx.xx failed. Lock denied
Jan 13 23:08:15 node2 cmcld[20810]: Deamon exiting as it lost the quorum.

From the FR log on node1:


Jan 13 23:08:08:0:CLM:01: Event - Member node2 seems unhealthy, not receiving heartbeats from it.
Jan 13 23:08:14:0:CLM:01: External error - Timed out unhealthy member(s). .. Jan 13 23:08:14:3:CMD:01: Action - Preliminary server quorum unknown with 1 out of 2 votes
Jan 13 23:08:14:3:CMD:01: Action - Target quorum server resolve is GETTING
Jan 13 23:08:14:3:CMD:01: Event - Created verify info=2,1,1
Jan 13 23:08:14:3:CMD:01: Action - Quorum server resolve state is GETTING, was IDLE
Jan 13 23:08:14:0:CMD:01: Action - Attempting to get quorum server lock /sg/xxx. Active members:node1
Jan 13 23:08:14:2:CMD:01: Event - Sending message 16201(16201) to xx.xx.xx.xx Jan 13 23:08:14:3:CMD:01: Event - Starting lock request timer qm_resolve
Jan 13 23:08:14:1:CMD:01: Action - server quorum in_progress with 1 out of 2 votes Jan 13 23:08:14:3:CMD:01: Action - active ids: 1
Jan 13 23:08:14:2:CMD:01: Action - Quorum use is BUSY <<<<====== Jan 13 23:08:14:0:CLM:01: Action - Membership: membership at 1 is REFORMING (coordinator 1) includes: 1 excludes: 2

This says that the nodes both requested lock during reformation and QS gave it to node1 and denied it from node2. However, per node1's FR log, it requested the lock but never seems to receive it acknowledging the QS is busy. This can only happen if somehow the QS subnet between node1 and QS was lost right after node1 sent request to the lock (therefore the lock request went through but the lock reply from QS to node1 was not received, which is very unlikely) or the network problems on site 1 caused the subnet for the node1 - QS connection to be mono-directional (i.e, node1 to QS is OK but QS to node1 is broken) therefore, explaining the success of lock request from node1 but failure of receipt of lock from QS.

Checking the xportshow output, one can see the status of the QS connection:


Active Internet connections (including servers) Proto Recv-Q Send-Q Local Address Foreign Address (state) ..
tcp 0 112 xx.xx.xx.xx.56570 xx.xx.xx.xx.1238 ESTABLISHED

There are 112 bytes in the send queue, which means that the QS has not acknowledged the data on the IP level. This is a strong indication that the network connection was broken in one direction: The node could send messages to the QS but did not receive anything back.

In fact it turns out that the latter is what took place.

So in summary this is "dual failure" situation.

  1. HB network failed.

  2. QS network failed.

Given that Serviceguard is a "Single Point of Failure" High Availability product, it cannot handle 2 or more simultaneous (or within a short time, that second failure happens before the first failure is handled by SG) failures, the only way to prevent such situations is to prevent "dual failure" situations from happening. This can only be accomplished by designing the cluster with enough redundancies to prevent such situations.

Provide feedback

Please rate the information on this page to help us improve our content. Thank you!