Page 1 of 4

Vertica Nodes Randomly Fail

Posted: Tue Jul 16, 2013 4:16 pm
by becky
Hey guys,

I have a three node cluster and the the nodes randomly fail.

Here is a part of the vertica.log file that starts where I think a node failed. But I can't figure out why. Can someone take a look and let me know if they've experienced and issues with Vertica nodes failing... constantly... thanks in advance. This is with Vertica 6.1.2.

2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw membership message 8192 on V:verticadb
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw membership message 8192 on Vertica:all
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw membership message 8192 on Vertica:join
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw transitional message; watch for lost daemons
2013-07-16 15:01:00.448 Spread Client:0x7ea89d0 [Comms] <INFO> Saw membership message 6144 on V:verticadb
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> NETWORK change with 2 VS sets
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> DB Group changed
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> Got current member #r416-15#NXXXXXXXXX187, v_verticadb_node0004 is UP
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [VMPI] <INFO> DistCall: Set current group members called with 1 members
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> nodeSetNotifier: node v_verticadb_node0001 left the cluster
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Recover] <INFO> Checking Deps:Down bits: 001 Deps:
111 - cnt: 38
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 <LOG> @v_verticadb_node0004: 00000/3298: Event Posted: Event Code:3 Event Id:0 Event Severity: Critical [2] PostedTimestamp: 2013-07-16 15:01:00.449481 ExpirationTimestamp: 2081-08-03 18:15:07.449481 EventCodeDescription: Current Fault Tolerance at Critical Level ProblemDescription: Loss of node v_verticadb_node0004 will cause shutdown to occur. K=1 total number of nodes=3 DatabaseName: verticadb Hostname: vertica04
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Comms] <INFO> nodeSetNotifier: node v_verticadb_node0003 left the cluster
2013-07-16 15:01:00.449 Spread Client:0x7ea89d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 [Recover] <INFO> Cluster partitioned: 3 total nodes, 1 up nodes, 2 down nodes
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 [Recover] <INFO> Setting node v_verticadb_node0004 to UNSAFE
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 <LOG> @v_verticadb_node0004: 00000/3298: Event Posted: Event Code:6 Event Id:5 Event Severity: Informational [6] PostedTimestamp: 2013-07-16 15:01:00.45005 ExpirationTimestamp: 2081-08-03 18:15:07.45005 EventCodeDescription: Node State Change ProblemDescription: Changing node v_verticadb_node0004 startup state to UNSAFE DatabaseName: verticadb Hostname: vertica04
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 <LOG> @v_verticadb_node0004: 00000/3293: Event Cleared: Event Code:6 Event Id:6 Event Severity: Informational [6] PostedTimestamp: 2013-07-16 15:01:00.450118 ExpirationTimestamp: 2013-07-16 15:01:00.450118 EventCodeDescription: Node State Change ProblemDescription: Changing node v_verticadb_node0004 leaving startup state UP DatabaseName: verticadb Hostname: vertica04
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 [Recover] <INFO> Changing node v_verticadb_node0004 startup state from UP to UNSAFE
2013-07-16 15:01:00.450 Spread Client:0x7ea89d0 <LOG> @v_verticadb_node0004: 00000/3298: Event Posted: Event Code:2 Event Id:0 Event Severity: Emergency [0] PostedTimestamp: 2013-07-16 15:01:00.450501 ExpirationTimestamp: 2013-07-16 15:11:00.450501 EventCodeDescription: Loss Of K Safety ProblemDescription: System is not K-safe: K=1 total number of nodes=3 DatabaseName: verticadb Hostname: vertica04
2013-07-16 15:01:00.547 Spread Client:0x7ea89d0 [Comms] <INFO> stop: disconnecting #r416-15#NXXXXXXXXX187 from spread daemon
2013-07-16 15:01:00.547 Spread Client:0x7ea89d0 [Comms] <INFO> connected: false
2013-07-16 15:01:00.547 Spread Client:0x7ea89d0 [Comms] <INFO> DB Group changed
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [VMPI] <INFO> DistCall: Set current group members called with 0 members
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> nodeSetNotifier: node v_verticadb_node0004 left the cluster
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Recover] <INFO> Node left cluster, reassessing k-safety...
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Lost membership of the DB group
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Removing #r3645-15#NXXXXXXXXX181->v_verticadb_node0003 from processToNode and other maps due to departure from Vertica:all
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> nodeToState map:
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Removing #r416-15#NXXXXXXXXX187->v_verticadb_node0004 from processToNode and other maps due to departure from Vertica:all
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> nodeToState map:
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Removing #r5705-15#NXXXXXXXXX180->v_verticadb_node0001 from processToNode and other maps due to departure from Vertica:all
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> nodeToState map:
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> Lost membership of V:All
2013-07-16 15:01:00.551 Spread Client:0x7ea89d0 [Comms] <INFO> spread thread exiting
2013-07-16 15:01:00.554 SafetyShutdown:0x7f7afc0071b0 [Shutdown] <INFO> Shutting down this node

Re: Vertica Node Randomly Fail

Posted: Tue Jul 16, 2013 5:57 pm
by scutter
Hi Becky,

What do you see in log files for the nodes that this node sees as going down? Do they actually go down? Is there anything in those log other than just "nodennnn left the cluster"? If there's nothing else in there, then it's a network issue.

- Are these nodes in a hosted environment?
- Are the nodes using a private network for vertica's data and the spread traffic?
- Check dmesg and/or /var/log/messages to see if the network interfaces are going down
- Does the spread process remain up on all nodes?

On a separate topic, the log fragment that you posted:

Deps:
111 - cnt: 38

This tells me that you have 38 projections that have segments on all nodes (unsegmented all nodes). Are you intentionally defining them that way, versus segmenting them across all nodes?

--Sharon

Re: Vertica Node2 Randomly Fail

Posted: Tue Jul 16, 2013 7:47 pm
by becky
Hi Sharon,

Thanks for getting back on my issue! Yes, the servers are in a hosted environment. They are are VMs. I agree that it's a network issue. When a node goes down the spread process continues to run. Also the vertica.pid doesn't get deleted. For me to restart Vertica on the failed host I have to delete the pid file manually.

There is nothing else in the Vertica log files to say why it failed. Is there a spread log file that I can look at? The /var/log/spreadd.log isn't very helpful.

For the segmented nodes issue, yes, I was created them manually to test something else. I was going to drop those tables...

Thanks!

Re: Vertica Node2 Randomly Fail

Posted: Tue Jul 16, 2013 8:11 pm
by scutter
When you installed vertica did you use the default -U for the spread communications? If yes, then rerun install_vertica using -T -S default which is recommended both for hosted environments and for VMs.

--Sharon

Re: Vertica Nodes Randomly Fail

Posted: Wed Jul 17, 2013 12:22 am
by becky
Hi Scutter,

Thanks! I re-ran the install like this:

/opt/vertica/sbin/install_vertica -T -s v01,v02,v03 -r vertica-6.1.2-0.x86_64.RHEL5.rpm

It ran okay, and I restarted the DB. I'll let you know how it goes!!!

Re: Vertica Nodes Randomly Fail

Posted: Wed Jul 17, 2013 12:32 am
by becky
Oh, I just noticed one weird message at the end of running the install script (See below):
  • ...
    Updating spread configuration...
    Verifying spread configuration on whole cluster.
    Error Monitor 0 errors 4 warnings
    Installation completed with warnings.
    Exception vertica.utils.pexpect.ExceptionPexpect: ExceptionPexpect() in <bound method spawn.__del__ of <vertica.utils.pexpect.spawn object at 0x23141d0>> ignored
    Installation complete.

    To create a database:
    1. Logout and login as dbadmin.**
    2. Run /opt/vertica/bin/adminTools as dbadmin
    3. Select Create Database from the Configuration Menu

    ** The installation modified the group privileges for dbadmin.
    If you used sudo to install vertica as dbadmin, you will
    need to logout and login again before the privileges are applied.
To you think that exception is an issue?

Re: Vertica Nodes Randomly Fail

Posted: Wed Jul 17, 2013 2:03 am
by scutter
If the database is running again and the spread reconfig is correct, then it's probably not an issue. Verify that /opt/vertica/config/vspread.conf has multiple Spread_Segments in it.

But probably worth checking the installation logs to see if you can track down what the script was doing when that error/warning occurred.

--Sharon