@scutter
That was a good one!
But the epoch is set for every commit performed. So if you are loading a table using a ETL tool with commit interval
as 10k, all the duplicates in that 10k will have the same epoch time. So epoch wont be of much help in such a scenario.
Now consider a scenario of 10k records being loaded at one go having around 5k exact duplicates (in all the fields).
The solution which I could think of is
1) select all the records with duplicates along with a row_number aggregate and insert into a temporary table with similar structure.
2) Now delete from Main Table all the duplicate records
3) Insert into the Main table, all the records from temporary table having row_number=1
4) drop the temporary table
This might improve the performance as deleting via 'IN' operator could take more time than just deleting all the duplicates and inserting the
unique records from temporary table. I havent tested this for performance yet. Any thoughts on this?
Abeesh
De-Duplication in Vertica
Moderator: NorbertKrupa
- abeeshdreams
- Newbie
- Posts: 7
- Joined: Tue Jul 10, 2012 6:11 am
Re: De-Duplication in Vertica
Abeeshdreams
Re: De-Duplication in Vertica
Ideally you would de-dup during the load process on the way into the database, so that those 10k batches of records don't have any duplicates.
Sharon Cutter
Vertica Consultant, Zazz Technologies LLC
Vertica Consultant, Zazz Technologies LLC
-
- GURU
- Posts: 527
- Joined: Tue Oct 22, 2013 9:36 pm
- Location: Chicago, IL
- Contact:
Re: De-Duplication in Vertica
You are just a treasure trove of undocumented goodies.scutter wrote:you can use the undocumented 'epoch' column.
Checkout vertica.tips for more Vertica resources.