Vertica Nodes Randomly Fail
Posted: Tue Jul 16, 2013 4:16 pm
Hey guys,
I have a three node cluster and the the nodes randomly fail.
Here is a part of the vertica.log file that starts where I think a node failed. But I can't figure out why. Can someone take a look and let me know if they've experienced and issues with Vertica nodes failing... constantly... thanks in advance. This is with Vertica 6.1.2.
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw membership message 8192 on V:verticadb
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw membership message 8192 on Vertica:all
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw membership message 8192 on Vertica:join
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw membership message 6144 on V:verticadb
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> NETWORK change with 2 VS sets
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> DB Group changed
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> Got current member #r416-15#NXXXXXXXXX187, v_verticadb_node0004 is UP
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [VMPI] <INFO> DistCall: Set current group members called with 1 members
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> nodeSetNotifier: node v_verticadb_node0001 left the cluster
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Recover] <INFO> Checking Deps:Down bits: 001 Deps:
111 - cnt: 38
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 <LOG> @v_verticadb_node0004: 00000/3298: Event Posted: Event Code:3 Event Id:0 Event Severity: Critical [2] PostedTimestamp: 2013-07-16 15:01:00.449481 ExpirationTimestamp: 2081-08-03 18:15:07.449481 EventCodeDescription: Current Fault Tolerance at Critical Level ProblemDescription: Loss of node v_verticadb_node0004 will cause shutdown to occur. K=1 total number of nodes=3 DatabaseName: verticadb Hostname: vertica04
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> nodeSetNotifier: node v_verticadb_node0003 left the cluster
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 [Recover] <INFO> Setting node v_verticadb_node0004 to UNSAFE
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 <LOG> @v_verticadb_node0004: 00000/3298: Event Posted: Event Code:6 Event Id:5 Event Severity: Informational [6] PostedTimestamp: 2013-07-16 15:01:00.45005 ExpirationTimestamp: 2081-08-03 18:15:07.45005 EventCodeDescription: Node State Change ProblemDescription: Changing node v_verticadb_node0004 startup state to UNSAFE DatabaseName: verticadb Hostname: vertica04
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 <LOG> @v_verticadb_node0004: 00000/3293: Event Cleared: Event Code:6 Event Id:6 Event Severity: Informational [6] PostedTimestamp: 2013-07-16 15:01:00.450118 ExpirationTimestamp: 2013-07-16 15:01:00.450118 EventCodeDescription: Node State Change ProblemDescription: Changing node v_verticadb_node0004 leaving startup state UP DatabaseName: verticadb Hostname: vertica04
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 [Recover] <INFO> Changing node v_verticadb_node0004 startup state from UP to UNSAFE
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 <LOG> @v_verticadb_node0004: 00000/3298: Event Posted: Event Code:2 Event Id:0 Event Severity: Emergency [0] PostedTimestamp: 2013-07-16 15:01:00.450501 ExpirationTimestamp: 2013-07-16 15:11:00.450501 EventCodeDescription: Loss Of K Safety ProblemDescription: System is not K-safe: K=1 total number of nodes=3 DatabaseName: verticadb Hostname: vertica04
2013-07-16 15:01:00.547 Spread Client:0x7ea89d0 [Comms] <INFO> stop: disconnecting #r416-15#NXXXXXXXXX187 from spread daemon
2013-07-16 15:01:00.547 Spread Client:0x7ea89d0 [Comms] <INFO> connected: false
2013-07-16 15:01:00.547 Spread Client:0x7ea89d0 [Comms] <INFO> DB Group changed
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [VMPI] <INFO> DistCall: Set current group members called with 0 members
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> nodeSetNotifier: node v_verticadb_node0004 left the cluster
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Lost membership of the DB group
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Removing #r3645-15#NXXXXXXXXX181->v_verticadb_node0003 from processToNode and other maps due to departure from Vertica:all
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> nodeToState map:
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Removing #r416-15#NXXXXXXXXX187->v_verticadb_node0004 from processToNode and other maps due to departure from Vertica:all
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> nodeToState map:
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Removing #r5705-15#NXXXXXXXXX180->v_verticadb_node0001 from processToNode and other maps due to departure from Vertica:all
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> nodeToState map:
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Lost membership of V:All
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> spread thread exiting
2013-07-16 15:01:00.554 SafetyShutdown:0x7f7afc0071b0 [Shutdown] <INFO> Shutting down this node
I have a three node cluster and the the nodes randomly fail.
Here is a part of the vertica.log file that starts where I think a node failed. But I can't figure out why. Can someone take a look and let me know if they've experienced and issues with Vertica nodes failing... constantly... thanks in advance. This is with Vertica 6.1.2.
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw membership message 8192 on V:verticadb
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw membership message 8192 on Vertica:all
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw membership message 8192 on Vertica:join
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw membership message 6144 on V:verticadb
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> NETWORK change with 2 VS sets
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> DB Group changed
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> Got current member #r416-15#NXXXXXXXXX187, v_verticadb_node0004 is UP
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [VMPI] <INFO> DistCall: Set current group members called with 1 members
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> nodeSetNotifier: node v_verticadb_node0001 left the cluster
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Recover] <INFO> Checking Deps:Down bits: 001 Deps:
111 - cnt: 38
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 <LOG> @v_verticadb_node0004: 00000/3298: Event Posted: Event Code:3 Event Id:0 Event Severity: Critical [2] PostedTimestamp: 2013-07-16 15:01:00.449481 ExpirationTimestamp: 2081-08-03 18:15:07.449481 EventCodeDescription: Current Fault Tolerance at Critical Level ProblemDescription: Loss of node v_verticadb_node0004 will cause shutdown to occur. K=1 total number of nodes=3 DatabaseName: verticadb Hostname: vertica04
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> nodeSetNotifier: node v_verticadb_node0003 left the cluster
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 [Recover] <INFO> Setting node v_verticadb_node0004 to UNSAFE
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 <LOG> @v_verticadb_node0004: 00000/3298: Event Posted: Event Code:6 Event Id:5 Event Severity: Informational [6] PostedTimestamp: 2013-07-16 15:01:00.45005 ExpirationTimestamp: 2081-08-03 18:15:07.45005 EventCodeDescription: Node State Change ProblemDescription: Changing node v_verticadb_node0004 startup state to UNSAFE DatabaseName: verticadb Hostname: vertica04
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 <LOG> @v_verticadb_node0004: 00000/3293: Event Cleared: Event Code:6 Event Id:6 Event Severity: Informational [6] PostedTimestamp: 2013-07-16 15:01:00.450118 ExpirationTimestamp: 2013-07-16 15:01:00.450118 EventCodeDescription: Node State Change ProblemDescription: Changing node v_verticadb_node0004 leaving startup state UP DatabaseName: verticadb Hostname: vertica04
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 [Recover] <INFO> Changing node v_verticadb_node0004 startup state from UP to UNSAFE
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 <LOG> @v_verticadb_node0004: 00000/3298: Event Posted: Event Code:2 Event Id:0 Event Severity: Emergency [0] PostedTimestamp: 2013-07-16 15:01:00.450501 ExpirationTimestamp: 2013-07-16 15:11:00.450501 EventCodeDescription: Loss Of K Safety ProblemDescription: System is not K-safe: K=1 total number of nodes=3 DatabaseName: verticadb Hostname: vertica04
2013-07-16 15:01:00.547 Spread Client:0x7ea89d0 [Comms] <INFO> stop: disconnecting #r416-15#NXXXXXXXXX187 from spread daemon
2013-07-16 15:01:00.547 Spread Client:0x7ea89d0 [Comms] <INFO> connected: false
2013-07-16 15:01:00.547 Spread Client:0x7ea89d0 [Comms] <INFO> DB Group changed
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [VMPI] <INFO> DistCall: Set current group members called with 0 members
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> nodeSetNotifier: node v_verticadb_node0004 left the cluster
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Lost membership of the DB group
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Removing #r3645-15#NXXXXXXXXX181->v_verticadb_node0003 from processToNode and other maps due to departure from Vertica:all
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> nodeToState map:
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Removing #r416-15#NXXXXXXXXX187->v_verticadb_node0004 from processToNode and other maps due to departure from Vertica:all
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> nodeToState map:
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Removing #r5705-15#NXXXXXXXXX180->v_verticadb_node0001 from processToNode and other maps due to departure from Vertica:all
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> nodeToState map:
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Lost membership of V:All
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> spread thread exiting
2013-07-16 15:01:00.554 SafetyShutdown:0x7f7afc0071b0 [Shutdown] <INFO> Shutting down this node