How to optimize

And01 · Post by **And01** » Tue Jan 22, 2013 11:28 am

Dear All, i have tried to get answer in the official topics on hp site, but no reply. May be you can help or advise.

I am trying to test Vertica. For tests I have one table with many id’s, and few sums and dates. Table contains ~100millions of rows. This is similar to real fact table.

So I have something like
Id_unique,
country_id (cardinality 60),
city_id (cardinality 150)
source_id ( cardinality 2)
sex_id
personal_id (cardinality over 2millions)
family_id (cardinality over 1 millions)
document_id (cardinality over 20millions)
<…>
Pay_Date (integer, ~10 years)
Amount (numeric)

Huge number of my reports needs to answer queries like:

select city_id, sum(amount) as suma, count(distinct personal_id) AS receivers, count(distinct family_id) AS families, count(distinct document_id) AS times
from big_table group by city_id order by city_id where pay_date <=YYY and pay_date >=XXX

(Business meaning: how much money paid per city, how many different persons got money, how many different families got money, now many payments done)
I am getting answer to my queries in ~30sek. I would like to achieve speed of 5-7 seconds.

For test I was using one node. I have tried to run 2Core, 2GB and 4Core, 4GB virtual machines.
I have created table, loaded data and have Database Design (Comprehensive, using several test SQLs).

Is there anything what I can do to speedup of my queries?
Will it be faster if I will try to add one more node?
Is my amount of data (about 30GD of raw data) at all reasonable to add more nodes?

I have tried to do partitioning by country_id (and running select using group by country, instead of group by city_id) but speed become even worse (compare to speed without partitioning).

Thank you for your advice what can I try to do in order to get better speed!

jpcavanaugh · Post by **jpcavanaugh** » Tue Jan 22, 2013 9:17 pm

Have you run DBD on your table/queries? Are you seeing any spilling of the group by to disk? You are below the minimum memory requirements for Vertica as well.

zvika · Post by **zvika** » Wed Jan 23, 2013 9:30 am

Hi ,

** your query isn't valid ( where needs to be before the group by )

First try to partition base on the pay_date column or function on this column so it will create daily or monthly partitions and as a result Vertica will use partition pruning and won't scan the entire table .
Second create a projection which sort by pay_date,city_id,document_id,family_id,personal_id,amount
Third try to split the query to several distinct and sum queries and use more projection to support those queries .
for example : select city_id, sum(amount) as suma from big_table where pay_date <=YYY and pay_date >=XXX group by city_id order by city_id
projection order like pay_date,city_id ,amount will give you good response time .

let us know if it helped

p.s

and yes yes yes use dbd also

Zvika

And01 · Post by **And01** » Wed Jan 23, 2013 12:01 pm

Zvika, jpcavanaugh

thank you a lot for your reply. I will try to check your suggestions and tell result. I am traveling - therefore there may be some pause in my answers, but I will definitely reply.

Some few comments:

Zvika, yes - my sql was with mistake, I have not copied it, but was writing from memory. In real SQL "where" is in right place

DBD - you mean Database Designer? I have tried to do it, but noticed that for some reason test query was not accepted. I will check what was a reason. I thinking that may be it is "distinct" issue. Database designer was writing something like "SQL cannot be used for optimization due to it's nature". But I will check.

I feel there is no reason to make partitioning on pay_date, is'n it? I am telling that because I use a lot of count DISTINCT on person_id , family_id or document_id. so, I need to have COUNT DISTINCT from different periods. So I feel that partitioning by date will not help me. I am trying to doing partitioning by country or by city. My first try have created ~60 partitions and I think this was to big number of partitions (Queries become slower). I am incorrect?

Regarding memory and disks. I will try to increase amount of memory, but I have one issue. I am doing test of vertica in parallel to test of two other column oriented dbmses. I like to get same performance from vertica (or better) what competitors engines are showing (preferable on the same HW). I can discuss privately about this, but do not like to put all details about other engines into this forum.

And01 · Post by **And01** » Wed Jan 23, 2013 12:25 pm

jpcavanaugh, one more question. I was interested by your comment, but cannot find source - where minimum requirements for memory are described? What is a right document? Thank you for your advise

jpcavanaugh · Post by **jpcavanaugh** » Wed Jan 23, 2013 4:21 pm

For minimums, Vertica has a 2G per core value. But I really recommend 4-8G per core. Can you check dc_execution_engine_events (order by time desc) to see if your group by is spilling to disk. Also, please let me know if you have any encoding on the projections or if you are using default projections.

And01 · Post by **And01** » Tue Feb 05, 2013 2:39 pm

I have noticed that DB architect is not working

(I am getting message

Query 1 optimization ratio or status is CANNOT BE OPTIMIZED DUE TO QUERY TYPE

)
when I have 2 distinct in my SQL (count (distinct XXX), count(distinct YYY) group by ZZZ). Can it be true?

I think that am using default projections. In the deployment file I see something like

CREATE PROJECTION public._TaBLE_DBD_1_rep_ba_ba
(
p__id ENCODING COMMONDELTA_COMP,
p_wws_id ENCODING RLE,
p_wss_id ENCODING RLE,
ttt_id ENCODING COMMONDELTA_COMP...

etc.

How to optimize

How to optimize

Re: How to optimize

Re: How to optimize

Re: How to optimize

Re: How to optimize

Re: How to optimize

Re: How to optimize