Dear All, i have tried to get answer in the official topics on hp site, but no reply. May be you can help or advise.
I am trying to test Vertica. For tests I have one table with many id’s, and few sums and dates. Table contains ~100millions of rows. This is similar to real fact table.
So I have something like
Id_unique,
country_id (cardinality 60),
city_id (cardinality 150)
source_id ( cardinality 2)
sex_id
personal_id (cardinality over 2millions)
family_id (cardinality over 1 millions)
document_id (cardinality over 20millions)
<…>
Pay_Date (integer, ~10 years)
Amount (numeric)
Huge number of my reports needs to answer queries like:
select city_id, sum(amount) as suma, count(distinct personal_id) AS receivers, count(distinct family_id) AS families, count(distinct document_id) AS times
from big_table group by city_id order by city_id where pay_date <=YYY and pay_date >=XXX
(Business meaning: how much money paid per city, how many different persons got money, how many different families got money, now many payments done)
I am getting answer to my queries in ~30sek. I would like to achieve speed of 5-7 seconds.
For test I was using one node. I have tried to run 2Core, 2GB and 4Core, 4GB virtual machines.
I have created table, loaded data and have Database Design (Comprehensive, using several test SQLs).
Is there anything what I can do to speedup of my queries?
Will it be faster if I will try to add one more node?
Is my amount of data (about 30GD of raw data) at all reasonable to add more nodes?
I have tried to do partitioning by country_id (and running select using group by country, instead of group by city_id) but speed become even worse (compare to speed without partitioning).
Thank you for your advice what can I try to do in order to get better speed!
How to optimize
Moderator: NorbertKrupa
-
- Intermediate
- Posts: 149
- Joined: Mon Apr 30, 2012 10:04 pm
- Location: New York
- Contact:
Re: How to optimize
Have you run DBD on your table/queries? Are you seeing any spilling of the group by to disk? You are below the minimum memory requirements for Vertica as well.
Re: How to optimize
Hi ,
** your query isn't valid ( where needs to be before the group by )
First try to partition base on the pay_date column or function on this column so it will create daily or monthly partitions and as a result Vertica will use partition pruning and won't scan the entire table .
Second create a projection which sort by pay_date,city_id,document_id,family_id,personal_id,amount
Third try to split the query to several distinct and sum queries and use more projection to support those queries .
for example : select city_id, sum(amount) as suma from big_table where pay_date <=YYY and pay_date >=XXX group by city_id order by city_id
projection order like pay_date,city_id ,amount will give you good response time .
let us know if it helped
p.s
and yes yes yes use dbd also
Zvika
** your query isn't valid ( where needs to be before the group by )
First try to partition base on the pay_date column or function on this column so it will create daily or monthly partitions and as a result Vertica will use partition pruning and won't scan the entire table .
Second create a projection which sort by pay_date,city_id,document_id,family_id,personal_id,amount
Third try to split the query to several distinct and sum queries and use more projection to support those queries .
for example : select city_id, sum(amount) as suma from big_table where pay_date <=YYY and pay_date >=XXX group by city_id order by city_id
projection order like pay_date,city_id ,amount will give you good response time .
let us know if it helped
p.s
and yes yes yes use dbd also
Zvika
Re: How to optimize
Zvika, jpcavanaugh
thank you a lot for your reply. I will try to check your suggestions and tell result. I am traveling - therefore there may be some pause in my answers, but I will definitely reply.
Some few comments:
Zvika, yes - my sql was with mistake, I have not copied it, but was writing from memory. In real SQL "where" is in right place
DBD - you mean Database Designer? I have tried to do it, but noticed that for some reason test query was not accepted. I will check what was a reason. I thinking that may be it is "distinct" issue. Database designer was writing something like "SQL cannot be used for optimization due to it's nature". But I will check.
I feel there is no reason to make partitioning on pay_date, is'n it? I am telling that because I use a lot of count DISTINCT on person_id , family_id or document_id. so, I need to have COUNT DISTINCT from different periods. So I feel that partitioning by date will not help me. I am trying to doing partitioning by country or by city. My first try have created ~60 partitions and I think this was to big number of partitions (Queries become slower). I am incorrect?
Regarding memory and disks. I will try to increase amount of memory, but I have one issue. I am doing test of vertica in parallel to test of two other column oriented dbmses. I like to get same performance from vertica (or better) what competitors engines are showing (preferable on the same HW). I can discuss privately about this, but do not like to put all details about other engines into this forum.
thank you a lot for your reply. I will try to check your suggestions and tell result. I am traveling - therefore there may be some pause in my answers, but I will definitely reply.
Some few comments:
Zvika, yes - my sql was with mistake, I have not copied it, but was writing from memory. In real SQL "where" is in right place
DBD - you mean Database Designer? I have tried to do it, but noticed that for some reason test query was not accepted. I will check what was a reason. I thinking that may be it is "distinct" issue. Database designer was writing something like "SQL cannot be used for optimization due to it's nature". But I will check.
I feel there is no reason to make partitioning on pay_date, is'n it? I am telling that because I use a lot of count DISTINCT on person_id , family_id or document_id. so, I need to have COUNT DISTINCT from different periods. So I feel that partitioning by date will not help me. I am trying to doing partitioning by country or by city. My first try have created ~60 partitions and I think this was to big number of partitions (Queries become slower). I am incorrect?
Regarding memory and disks. I will try to increase amount of memory, but I have one issue. I am doing test of vertica in parallel to test of two other column oriented dbmses. I like to get same performance from vertica (or better) what competitors engines are showing (preferable on the same HW). I can discuss privately about this, but do not like to put all details about other engines into this forum.
Re: How to optimize
jpcavanaugh, one more question. I was interested by your comment, but cannot find source - where minimum requirements for memory are described? What is a right document? Thank you for your advise
-
- Intermediate
- Posts: 149
- Joined: Mon Apr 30, 2012 10:04 pm
- Location: New York
- Contact:
Re: How to optimize
For minimums, Vertica has a 2G per core value. But I really recommend 4-8G per core. Can you check dc_execution_engine_events (order by time desc) to see if your group by is spilling to disk. Also, please let me know if you have any encoding on the projections or if you are using default projections.
Re: How to optimize
I have noticed that DB architect is not working
(I am getting message
Query 1 optimization ratio or status is CANNOT BE OPTIMIZED DUE TO QUERY TYPE
)
when I have 2 distinct in my SQL (count (distinct XXX), count(distinct YYY) group by ZZZ). Can it be true?
I think that am using default projections. In the deployment file I see something like
CREATE PROJECTION public._TaBLE_DBD_1_rep_ba_ba
(
p__id ENCODING COMMONDELTA_COMP,
p_wws_id ENCODING RLE,
p_wss_id ENCODING RLE,
ttt_id ENCODING COMMONDELTA_COMP...
etc.
(I am getting message
Query 1 optimization ratio or status is CANNOT BE OPTIMIZED DUE TO QUERY TYPE
)
when I have 2 distinct in my SQL (count (distinct XXX), count(distinct YYY) group by ZZZ). Can it be true?
I think that am using default projections. In the deployment file I see something like
CREATE PROJECTION public._TaBLE_DBD_1_rep_ba_ba
(
p__id ENCODING COMMONDELTA_COMP,
p_wws_id ENCODING RLE,
p_wss_id ENCODING RLE,
ttt_id ENCODING COMMONDELTA_COMP...
etc.