Vertica Nodes Randomly Fail

Moderator: NorbertKrupa

vcarusi
Beginner
Beginner
Posts: 29
Joined: Mon Apr 20, 2015 11:03 am

Re: Vertica Nodes Randomly Fail

Post by vcarusi » Tue May 17, 2016 1:19 pm

solved

The Vertica server is installed on Amazon EC2. There was an issue related to the instance/machine : "The instance is running on degraded hardware"

User avatar
JimKnicely
Site Admin
Site Admin
Posts: 1825
Joined: Sat Jan 21, 2012 4:58 am
Contact:

Re: Vertica Nodes Randomly Fail

Post by JimKnicely » Wed May 18, 2016 12:42 pm

Thanks for following up on this topic! I've had good and bad experience myself with AWS instances :roll:
Jim Knicely

Image

Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.

reddyhappydays
Newbie
Newbie
Posts: 1
Joined: Mon Nov 05, 2018 6:07 am

Re: Vertica Nodes Randomly Fail

Post by reddyhappydays » Fri Dec 07, 2018 2:03 am

Hello Team,

We are facing similar issue on our Ec2 Instance.

AWS Instance Type i3.metal ( 72 cores and 512 GB RAM )
o Node crashes intermittently . Applied Vertica suggestion as per the ticket but wasn’t successful.
o At least 2-3 nodes crashes per day.
o In the past we were using i3.8xlarge which is stable. But for better resources upgrade we are planning to change the Instance type and so far we tried 2 different instance which are always throwing "NETWORK change" exception.
o We had similar issue on i3.16xlarge Instance type. So we changed the instance type,which works for us in terms of performance and elapsed time of the queries.

Any help will be appriciated.

Exception:

2018-12-05 20:52:01.822 DistCall Dispatch:7f0184ea8700-a00000000e62e1 [Txn] <INFO> Rollback Txn: a00000000e62e1 'Select 1;'
2018-12-05 20:52:01.824 DistCall Dispatch:7efefd2bb700-a00000000e62e2 [Txn] <INFO> Rollback Txn: a00000000e62e2 'Select 1;'
2018-12-05 20:52:01.856 DistCall Dispatch:7f0184ea8700-a00000000e62e6 [Txn] <INFO> Rollback Txn: a00000000e62e6 'select
'POOLNAME='|| pool_name,
' PRIORITY='|| priority,
' MAXPOOLWAITSECONDS='|| (max(clock_timestamp() - queue_entry_timestamp) :: interval second)::int
from resource_queues
group by pool_name, priority
order by priority desc;'
2018-12-05 20:52:01.888 DistCall Dispatch:7f0184ea8700-a00000000e62e9 [Txn] <INFO> Rollback Txn: a00000000e62e9 'select ' NODENAME='||node_names,
' OBJECTNAME='||object_name,
' TRANSACTIONID='||transaction_id,
' LOCKMODE='||lock_mode,
' LOCKSCOPE='||lock_scope,
' REQUESTTIMESTAMP='||to_char(request_timestamp,'YYYY-MM-DD/HH24:MI:SS'),
' GRANTTIMESTAMP='||to_char(grant_timestamp,'YYYY-MM-DD/HH24:MI:SS'),
' ELAPSEDTIMEMIN='|| minute(current_timestamp - grant_timestamp),
' TRANSACTIONDESCRIPTION='||TRANSACTION_DESCRIPTION
from locks;'
2018-12-05 20:52:01.900 DistCall Dispatch:7efefd2bb700-a00000000e62e5 [Txn] <INFO> Rollback Txn: a00000000e62e5 'select to_char(current_timestamp,'YYYY-MM-DD HH24:MI:SS')
|| ' NODENAME=' || a.node_name
|| ' SESSIONCOUNT=' || a.sesscount
from (select node_name,to_char(count(*)) as sesscount
from sessions
group by 1 order by 1) a'
2018-12-05 20:52:02.001 DiskSpaceRefresher:7efefd2bb700 [Util] <INFO> Task 'DiskSpaceRefresher' enabled
2018-12-05 20:52:02.114 Init Session:7f0183ea6700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.187.105 port=37749 (connCnt 2)
2018-12-05 20:52:02.316 Init Session:7f0183ea6700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.187.105 port=41646 (connCnt 2)
2018-12-05 20:52:02.608 Init Session:7f0183ea6700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.188.239 port=7572 (connCnt 2)
2018-12-05 20:52:02.618 Init Session:7f0183ea6700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.187.69 port=48341 (connCnt 2)
2018-12-05 20:52:02.751 DistCall Dispatch:7eee05fa8700-a00000000e030e [Catalog] <INFO> Loading storage_containers table by scanning the catalog.
2018-12-05 20:52:03.155 Init Session:7f0183ea6700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.189.134 port=24844 (connCnt 2)
2018-12-05 20:52:03.224 Init Session:7f0184ea8700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.187.69 port=23765 (connCnt 3)
2018-12-05 20:52:03.310 Init Session:7efefd2bb700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.187.69 port=40039 (connCnt 4)
2018-12-05 20:52:03.321 Init Session:7f01846a7700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.187.105 port=16004 (connCnt 5)
2018-12-05 20:52:03.742 Init Session:7f01dcf55700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.188.239 port=52625 (connCnt 6)
2018-12-05 20:52:03.802 Init Session:7f01b3827700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.187.69 port=64130 (connCnt 7)
2018-12-05 20:52:04.210 Init Session:7f0144ff9700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.187.105 port=3174 (connCnt 8)
2018-12-05 20:52:04.682 Init Session:7f00e0efb700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.187.105 port=40819 (connCnt 9)
2018-12-05 20:52:05.930 DistCall Dispatch:7efeb13fb700-a00000000e62eb [Catalog] <INFO> Found 0 missing DFS files
2018-12-05 20:52:06.012 DistCall Dispatch:7f0144ff9700 <LOG> @v_DB_node0010: 00000/6947: Set addresses for 9 nodes
2018-12-05 20:52:06.431 Init Session:7f0144ff9700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.187.69 port=24816 (connCnt 2)
2018-12-05 20:52:06.469 Init Session:7f0144ff9700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.187.105 port=49068 (connCnt 2)
2018-12-05 20:52:06.564 DistCall Dispatch:7efeb13fb700-a00000000e62eb [Catalog] <INFO> Checking for missed alter partition events
2018-12-05 20:52:06.564 DistCall Dispatch:7efeb13fb700-a00000000e62eb [Catalog] <INFO> Found no missed alter partition events
2018-12-05 20:52:06.564 DistCall Dispatch:7efeb13fb700-a00000000e62eb [Catalog] <INFO> Checking for missed restore table events
2018-12-05 20:52:06.564 DistCall Dispatch:7efeb13fb700-a00000000e62eb [Catalog] <INFO> Found no missed restore table events
2018-12-05 20:52:06.564 DistCall Dispatch:7efeb13fb700-a00000000e62eb [Catalog] <INFO> Checking for missed replace node events
[9:18 PM, 12/5/2018] Dilip Rachamalla: 2018-12-05 20:52:06.564 DistCall Dispatch:7efeb13fb700-a00000000e62eb [Catalog] <INFO> Found no missed replace node events
2018-12-05 20:52:07.402 DistCall Dispatch:7efeb13fb700-a00000000e62eb [VMPI] <INFO> GetClusterLGE: My local node LGE = 0xbe98ec
2018-12-05 20:52:07.972 Init Session:7f0144ff9700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.188.239 port=19378 (connCnt 2)
2018-12-05 20:52:08.931 Init Session:7f01b3827700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.189.134 port=18220 (connCnt 2)
2018-12-05 20:52:09.338 Init Session:7f00e16fc700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.189.134 port=33650 (connCnt 3)
2018-12-05 20:52:09.420 Init Session:7f01846a7700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.187.69 port=64976 (connCnt 4)
2018-12-05 20:52:10.024 Init Session:7f0184ea8700 <LOG> @v_DB_node0010: 00000/2705: Connection received: host=10.81.187.69 port=6931 (connCnt 5)
2018-12-05 20:52:10.257 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> Saw membership message 6144 (0x1800) on V:DB
2018-12-05 20:52:10.257 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> NETWORK change with 2 VS sets
2018-12-05 20:52:10.257 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> VS set #0 (not mine) has 1 members (offset=36)

2018-12-05 20:52:10.257 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> VS set #0, member 0: #node_c#N010081187052

2018-12-05 20:52:10.257 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> VS set #1 (mine) has 9 members (offset=72)

2018-12-05 20:52:10.257 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> VS set #1, member 0: #node_11#N010081187063

2018-12-05 20:52:10.257 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> VS set #1, member 1: #node_f#N010081187084

2018-12-05 20:52:10.257 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> VS set #1, member 2: #node_13#N010081187120

2018-12-05 20:52:10.257 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> VS set #1, member 3: #node_e#N010081187135

2018-12-05 20:52:10.257 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> VS set #1, member 4: #node_a#N010081187155

2018-12-05 20:52:10.257 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> VS set #1, member 5: #node_10#N010081187178

2018-12-05 20:52:10.257 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> VS set #1, member 6: #node_b#N010081187205

2018-12-05 20:52:10.257 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> VS set #1, member 7: #node_12#N010081187207

2018-12-05 20:52:10.257 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> VS set #1, member 8: #node_d#N010081187245

2018-12-05 20:52:10.257 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> DB Group changed
2018-12-05 20:52:10.491 Spread Service InOrder Queue:7f00b17fa700 [VMPI] <INFO> DistCall: Set current group members called with 9 members
2018-12-05 20:52:10.491 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> Saw membership message 6144 (0x1800) on Vertica:all
2018-12-05 20:52:10.491 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> My global sequence value is 2867467155
2018-12-05 20:52:10.492 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> Saw membership message 6144 (0x1800) on Vertica:join
2018-12-05 20:52:10.872 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> nodeToState map:
2018-12-05 20:52:10.872 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> v_DB_node0001 : UP
2018-12-05 20:52:10.872 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> v_DB_node0002 : UP
2018-12-05 20:52:10.872 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> v_DB_node0003 : UNSAFE
2018-12-05 20:52:10.872 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> v_DB_node0004 : UP
2018-12-05 20:52:10.872 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> v_DB_node0005 : UP
2018-12-05 20:52:10.872 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> v_DB_node0006 : UP
2018-12-05 20:52:10.872 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> v_DB_node0007 : UP
2018-12-05 20:52:10.872 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> v_DB_node0008 : UP
2018-12-05 20:52:10.872 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> v_DB_node0009 : UP
2018-12-05 20:52:10.872 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> v_DB_node0010 : UP
2018-12-05 20:52:10.872 Spread Service InOrder Queue:7f00b17fa700 [Recover] <INFO> State change for node v_DB_node0003: UNSAFE; catalog 33504542
2018-12-05 20:52:10.872 Spread Service InOrder Queue:7f00b17fa700 [Recover] <INFO> Started recovery assessment task: request ID = 59284
2018-12-05 20:52:10.874 Timer Service:7f0083fff700-1300000000661fc [Txn] <INFO> Begin Txn: 1300000000661fc 'DFSUtil::getLocalNodeLGE'
2018-12-05 20:52:10.875 Timer Service:7f0083fff700-1300000000661fc [Catalog] <INFO> Found 0 missing DFS files
2018-12-05 20:52:10.875 Timer Service:7f0083fff700-1300000000661fc [Txn] <INFO> Rollback Txn: 1300000000661fc 'DFSUtil::getLocalNodeLGE'
2018-12-05 20:52:10.877 Timer Service:7f0083fff700-1300000000661fd [Txn] <INFO> Begin Txn: 1300000000661fd 'ProjUtil::getLocalNodeLGE'
2018-12-05 20:52:10.882 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> Saw membership message 5120 (0x1400) on V:DB
2018-12-05 20:52:10.882 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> DB Group changed
2018-12-05 20:52:10.908 Spread Service InOrder Queue:7f00b17fa700 [VMPI] <INFO> DistCall: Set current group members called with 9 members
2018-12-05 20:52:10.908 Spread Service InOrder Queue:7f00b17fa700 [VMPI] <INFO> Ending session v_DB_node0003-337349:0xb72c due to loss of 45035996273720626
2018-12-05 20:52:10.908 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> Saw membership message 5120 (0x1400) on Vertica:all
2018-12-05 20:52:10.908 Spread Service InOrder Queue:7f00b17fa700 [Comms] <INFO> Removing #node_c#N010081187052->v_DB_node0003 from processToNode and other maps due to departure from Vertica:all

Thanks
Dilip

Post Reply

Return to “Vertica Database Administration”