De-Duplication in Vertica

abeeshdreams · Post by **abeeshdreams** » Fri Nov 16, 2012 11:19 am

@scutter
That was a good one!

But the epoch is set for every commit performed. So if you are loading a table using a ETL tool with commit interval
as 10k, all the duplicates in that 10k will have the same epoch time. So epoch wont be of much help in such a scenario.

Now consider a scenario of 10k records being loaded at one go having around 5k exact duplicates (in all the fields).
The solution which I could think of is
1) select all the records with duplicates along with a row_number aggregate and insert into a temporary table with similar structure.
2) Now delete from Main Table all the duplicate records
3) Insert into the Main table, all the records from temporary table having row_number=1
4) drop the temporary table

This might improve the performance as deleting via 'IN' operator could take more time than just deleting all the duplicates and inserting the
unique records from temporary table. I havent tested this for performance yet. Any thoughts on this?

Abeesh

scutter · Post by **scutter** » Sat Nov 17, 2012 1:42 pm

Ideally you would de-dup during the load process on the way into the database, so that those 10k batches of records don't have any duplicates.

NorbertKrupa · Post by **NorbertKrupa** » Thu Apr 24, 2014 2:27 am

scutter wrote:you can use the undocumented 'epoch' column.

You are just a treasure trove of undocumented goodies.

De-Duplication in Vertica

Re: De-Duplication in Vertica

Re: De-Duplication in Vertica

Re: De-Duplication in Vertica