MySQL Forums
Forum List  »  NDB clusters

problems when network partitions
Posted by: Rick Porter
Date: March 08, 2006 11:39PM

We have been struggling to get recovery to work in the face of network partition. We have a single node group with replication of 3. The goal is that if the network is partitioned so as to effectively lose one subnet, the clients (not shown) would still have access from the two reachable subnets. Our configuration is the following:

Subnet 1:
host A1: mysqld and ndbd
host B1: ndb_mgmd (node 1)

Subnet 2:
host A2: mysqld and ndbd
host B2: ndb_mgmd (node 2)

Subnet 3:
host A3: mysqld and ndbd
host B3: ndb_mgmd (node 3)

connect string is B1,B2,B3, so the original arbitrator is on subnet 1 (node 1)

We cause a failure by disconnecting subnet 1 from the other subnets (subnets 2 & 3 can still reach each other). We have tried this when the ndbd on subnet 1 is the master and also when it is not the master with somewhat different results (shown at the end of the email).

In any event, we see something that does not agree with the documentation: According to the FAQ, when there is an odd number of ndbd nodes, as in this case, the majority should win. That is not happening. What appears to happen is that the winner is the partition with the arbitrator.

My main question:
- Can anyone explain why this is happening and/or help us define a configuration with the behavior we want? Having the winner be the partition with the arbitrator is not good if it is on the subnet that is no longer reachable. I would prefer that the cluster is the partition with the majority of ndbd nodes. Would it help to change the arbitration ranks and if so, how?

Some additional questions:
- While we are doing the partitioning experiment, we continually run selects on the DB. When the partition occurs, the select blocks for 7 seconds before continuing on. Is there a way to reduce this delay? I see a number of tunable parameters in the documentation, but I'm not sure which is appropriate (and none of them has a 7 second default).
- We see the terms 'president' (in logs) and 'master' (output of
ndb_mgm -e show). Are these the same?
- At one point we tried assiging node ids using 'Id=' in the config.ini. The node id assignment mechanism seemed to ignore it and just numbered he nodes according to the order in config.ini. Should use of 'Id=' work?
- Finally, when I try to search this forum, the dropdown list only offers 'Search All Forums'. Why can't I just search the cluster forum?

Thanks for any answers you can provide. Details of the scenarios we ran are at the end of this message. They include a few error messages, including one that says 'possible bug' and one that says 'please report a bug'.

- Rick




FIRST SCENARIO (tried multiple times) - ndbd on subnet 1 is master:

When the ndbd on subnet 1 is the master (according to
'ndb_mgm -e show'), the cluster after
failure comprises just the ndbd node on A1 (those on A2 and A3 shutdown).

Messages in the B1 cluster log (all coming from the ndbd on A1) include:
> Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 3 sig->failNo =
> ...
> Network partitioning - arbitration required
> ...
> President restarts arbitration thread [state=7]
> ...
> Arbitration won - positive reply from node 1
> ...
> Started arbitrator node 1 [ticket=0e7b0003dab4f903]


Messages in the B2 & C2 cluster logs include:
> Lost arbitrator node 1 - process failure [state=6] (from both ndbds)
> ...
> Forced node shutdown completed. Initiated by signal 0. Caused by error 2305


SECOND SCENARIO - master ndbd is in subnet 2 or 3:

We tried this twice with different results.

First run:
The first time, the result was the same as above -- the resultant cluster was the node on subnet 1 (other two ndbds shutdown)-- but the logs were different.

Messages in the B1 cluster log (coming from the ndbd on A1) include:
> Network partitioning - arbitration required
> ...
> President restarts arbitration thread [state=7]
> ...
> Arbitration won - positive reply from node 1
> ...
> GCP Take over started
> LCP Take over started
> ...
> (lots of other messages about LCP)
> ...
> Started arbitrator node 1 [ticket=0f9b0003dad35731]

Messages in the B2& B3 cluster logs include:
> Lost arbitrator node 1 - process failure [state=6]
> ...
> President restarts arbitration thread [state=1]
> ...
> Forced node shutdown completed. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.



Second run (scenario same as first run with the exception that we assigned the node ids differently - does it matter?):
The second time, the result was that the cluster ended up on subnets 2 & 3 (2 ndbds and B3 as the arbitrator), while the ndbd on subnet 1 shutdown. From the logs, it appears that the arbitrator was on B2 (node 2) at the beginning of the test -- not sure why this would be, since it is not first in the connect string. Guess it is related to the fact that we had partitioned the network and then unpartitioned it.

Messages in B1 cluster log include:
> Lost arbitrator node 2 - process failure [state=6]
> ...
> Forced node shutdown completed. Initiated by signal 0. Caused by error 2305: ...

Messages in B2 and B3 cluster logs include:
> Network partitioning - arbitration required
> ...
> President restarts arbitration thread [state=7]
> ...
> Arbitration won - positive reply from node 2
> Prepare arbitrator node 2 [ticket=10f40002daed1e62]
> Started arbitrator node 2 [ticket=10f40002daed1e62]

Options: ReplyQuote


Subject
Views
Written By
Posted
problems when network partitions
4336
March 08, 2006 11:39PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.