Hi all,
I'm trying to import data into vertica in a format that's not ideally suited for the task and I wanted to get opinions from experts about how I could do it better. Please let me know if this is not the right forum to post to or my question is too vague.
Simplifying a bit for the purposes of this question, let's say the data comes from a log file and each line looks something like this:
timestamp,user_id,event_name,unstructured_data
where unstructured_data is key-value pairs, separated by pipes (|).
For example, a row may be:
1416445888,12345,someEvent,key1=val1|key2=val2|key3=val3
Let's also say that after the data is imported, I'll want to run queries that involve some of the keys from the unstructured data.
My first attempt was to import the data as-is, with a column for each comma-separated field and all the unstructured data in a single column. Then I wrote a UDF to retrieve a value from that column given a key. This works - as in, it produces correct results - but can be really slow if I'm parsing a lot of large unstructured data columns. I found that if I were to extract those key-value pairs into their own columns my queries complete orders of magnitude faster.
Now my questions are:
1) Is my assumption correct that parsing the key value pairs from a large string in a query that involves a lot of rows (say a few hundred million rows) will always be slow, or should I be profiling my UDF some more?
2) Is there a recommended way of loading data in format like this? I'm thinking of a scheme where I'd have my main table with a lot of extra columns (say C0 through CN) and a schema table that maps those extra columns to keys in the unstructured data. It seems like I may have to process the data first in a format that's easier to import. Are there any facilities in vertica that may help with this? Flex tables looked relevant, but it may be better if I do the parsing myself to keep the schemas in sync, since keys for a given event could change over time
3) Is there anything I'm missing? Brand new to vertica but I'm extremely impressed by what I've seen so far
Thanks in advance for the help / advice!
Alex
Migrating semi-structured data
Moderator: NorbertKrupa
- JimKnicely
- Site Admin
- Posts: 1825
- Joined: Sat Jan 21, 2012 4:58 am
- Contact:
Re: Migrating semi-structured data
Hi!
I wonder if you'd be better off if you convert all of the key value pairs to rows as opposed to using columns?
Example:
First, you could have a staging table like this:
Then transpose the data in the key_val column into rows doing something like this:
Now your queries should be super fast.
I wonder if you'd be better off if you convert all of the key value pairs to rows as opposed to using columns?
Example:
First, you could have a staging table like this:
Code: Select all
dbadmin=> select * from test_staging;
ts_id | user_id | event_name | key_val
------------+---------+---------------+------------------------------------------------------------------
1416445888 | 12345 | someEvent | key1=val1|key2=val2|key3=val3
1416445888 | 12345 | System reboot | key1=1|key2=3|key3=3|key4=4
1416445889 | 12346 | Windows crash | key1=Again|key2=Worst Time|key3=In a presentation|key4=Got fired
(3 rows)
Code: Select all
dbadmin=> insert into test select ts_id, user_id, event_name, rn, split_part(split_part(key_val, '|', rn), '=', 1)::varchar key_col, split_part(split_part(key_val, '|', rn), '=', 2)::varchar val_col from (select t.*, row_number() over (partition by t.ts_id, t.user_id, t.event_name) rn from test_staging t cross join columns) foo where split_part(key_val, '|', rn) <> '';
OUTPUT
--------
11
(1 row)
dbadmin=> select * from test order by ts_id, user_id, event_name, key_seq;
ts_id | user_id | event_name | key_seq | key_col | val_col
------------+---------+---------------+---------+---------+-------------------
1416445888 | 12345 | System reboot | 1 | key1 | 1
1416445888 | 12345 | System reboot | 2 | key2 | 3
1416445888 | 12345 | System reboot | 3 | key3 | 3
1416445888 | 12345 | System reboot | 4 | key4 | 4
1416445888 | 12345 | someEvent | 1 | key1 | val1
1416445888 | 12345 | someEvent | 2 | key2 | val2
1416445888 | 12345 | someEvent | 3 | key3 | val3
1416445889 | 12346 | Windows crash | 1 | key1 | Again
1416445889 | 12346 | Windows crash | 2 | key2 | Worst Time
1416445889 | 12346 | Windows crash | 3 | key3 | In a presentation
1416445889 | 12346 | Windows crash | 4 | key4 | Got fired
(11 rows)
Code: Select all
dbadmin=> select * from test where user_id = 12345 and event_name = 'System reboot' and key_seq = 3;
ts_id | user_id | event_name | key_seq | key_col | val_col
------------+---------+---------------+---------+---------+---------
1416445888 | 12345 | System reboot | 3 | key3 | 3
(1 row)
Jim Knicely
Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.
Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.
Re: Migrating semi-structured data
Hi Jim (and the person who replied through messages, but I don't have permissions to reply yet!),
Thanks for the all the advice and quick reply. It sounds like this should work quite well for our use case!
Thanks for the all the advice and quick reply. It sounds like this should work quite well for our use case!
- JimKnicely
- Site Admin
- Posts: 1825
- Joined: Sat Jan 21, 2012 4:58 am
- Contact:
Re: Migrating semi-structured data
#abk,...but I don't have permissions to reply yet...
You should be able to PM now. Please let me know if you have any issues.
Thanks,
Jim
Jim Knicely
Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.
Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.
Re: Migrating semi-structured data
Hmm, I logged in and out of my account but can't PM yet, it says I may need to participate more before I have permissions.
@skwa:
The goal for this is to replace a long set of hive workflows. Right now, it's mostly exploratory work to see how fast we can go from raw logs in the format described in my OP to useful, processed data. The current solution in hivesql takes 5+ hours. I would like to be able to port those hivesql queries to verticasql but am open to a large amount of rewriting in the interest of performance.
There are relatively few limitations:
- I have OS access and can compile, I was running vertica 7.0.x but upgraded to 7.1 yesterday to try the vertica json functions. I can also install any external tools as needed.
- I have access to up to 20 nodes.
- In terms of the data itself, that unstructured table will have 1-2B records per day. The current workflow involves a lot of joins between that table and much smaller tables (typically hundreds to thousands of rows, some up to the lower millions), and most of it will deal with the current / previous day of data but there will be some historical queries (current hive table is partitioned by date). I will be importing the data into vertica daily, but once everything is running I will be looking to import the data throughout the day. I can do pre-processing of the data before / after loading it into vertica as needed.
You rated your solutions as:
1) Vertica SHELL-PACKAGE
2) Transform the data to JSON first and use a hybrid table
3) Use the vertica JSON functions
I'm assuming 2 is preferable to 3 because it doesn't have to reprocess the json data all the time. What are the advantages of 1?
Thanks again,
Alex
@skwa:
The goal for this is to replace a long set of hive workflows. Right now, it's mostly exploratory work to see how fast we can go from raw logs in the format described in my OP to useful, processed data. The current solution in hivesql takes 5+ hours. I would like to be able to port those hivesql queries to verticasql but am open to a large amount of rewriting in the interest of performance.
There are relatively few limitations:
- I have OS access and can compile, I was running vertica 7.0.x but upgraded to 7.1 yesterday to try the vertica json functions. I can also install any external tools as needed.
- I have access to up to 20 nodes.
- In terms of the data itself, that unstructured table will have 1-2B records per day. The current workflow involves a lot of joins between that table and much smaller tables (typically hundreds to thousands of rows, some up to the lower millions), and most of it will deal with the current / previous day of data but there will be some historical queries (current hive table is partitioned by date). I will be importing the data into vertica daily, but once everything is running I will be looking to import the data throughout the day. I can do pre-processing of the data before / after loading it into vertica as needed.
You rated your solutions as:
1) Vertica SHELL-PACKAGE
2) Transform the data to JSON first and use a hybrid table
3) Use the vertica JSON functions
I'm assuming 2 is preferable to 3 because it doesn't have to reprocess the json data all the time. What are the advantages of 1?
Thanks again,
Alex