Migrating semi-structured data

abk · Post by **abk** » Thu Nov 20, 2014 2:22 am

Hi all,

I'm trying to import data into vertica in a format that's not ideally suited for the task and I wanted to get opinions from experts about how I could do it better. Please let me know if this is not the right forum to post to or my question is too vague.

Simplifying a bit for the purposes of this question, let's say the data comes from a log file and each line looks something like this:

timestamp,user_id,event_name,unstructured_data

where unstructured_data is key-value pairs, separated by pipes (|).

For example, a row may be:

1416445888,12345,someEvent,key1=val1|key2=val2|key3=val3

Let's also say that after the data is imported, I'll want to run queries that involve some of the keys from the unstructured data.

My first attempt was to import the data as-is, with a column for each comma-separated field and all the unstructured data in a single column. Then I wrote a UDF to retrieve a value from that column given a key. This works - as in, it produces correct results - but can be really slow if I'm parsing a lot of large unstructured data columns. I found that if I were to extract those key-value pairs into their own columns my queries complete orders of magnitude faster.

Now my questions are:
1) Is my assumption correct that parsing the key value pairs from a large string in a query that involves a lot of rows (say a few hundred million rows) will always be slow, or should I be profiling my UDF some more?

2) Is there a recommended way of loading data in format like this? I'm thinking of a scheme where I'd have my main table with a lot of extra columns (say C0 through CN) and a schema table that maps those extra columns to keys in the unstructured data. It seems like I may have to process the data first in a format that's easier to import. Are there any facilities in vertica that may help with this? Flex tables looked relevant, but it may be better if I do the parsing myself to keep the schemas in sync, since keys for a given event could change over time

3) Is there anything I'm missing? Brand new to vertica but I'm extremely impressed by what I've seen so far

Thanks in advance for the help / advice!
Alex

Post by **JimKnicely** » Thu Nov 20, 2014 1:09 pm

Hi!

I wonder if you'd be better off if you convert all of the key value pairs to rows as opposed to using columns?

Example:

First, you could have a staging table like this:

Code: Select all

dbadmin=> select * from test_staging;
   ts_id    | user_id |  event_name   |                             key_val
------------+---------+---------------+------------------------------------------------------------------
 1416445888 |   12345 | someEvent     | key1=val1|key2=val2|key3=val3
 1416445888 |   12345 | System reboot | key1=1|key2=3|key3=3|key4=4
 1416445889 |   12346 | Windows crash | key1=Again|key2=Worst Time|key3=In a presentation|key4=Got fired
(3 rows)

Then transpose the data in the key_val column into rows doing something like this:

Code: Select all

dbadmin=> insert into test select ts_id, user_id, event_name, rn, split_part(split_part(key_val, '|', rn), '=', 1)::varchar key_col, split_part(split_part(key_val, '|', rn), '=', 2)::varchar val_col from (select t.*, row_number() over (partition by t.ts_id, t.user_id, t.event_name) rn from test_staging t cross join columns) foo where split_part(key_val, '|', rn) <> '';
 OUTPUT
--------
     11
(1 row)

dbadmin=> select * from test order by ts_id, user_id, event_name, key_seq;
   ts_id    | user_id |  event_name   | key_seq | key_col |      val_col
------------+---------+---------------+---------+---------+-------------------
 1416445888 |   12345 | System reboot |       1 | key1    | 1
 1416445888 |   12345 | System reboot |       2 | key2    | 3
 1416445888 |   12345 | System reboot |       3 | key3    | 3
 1416445888 |   12345 | System reboot |       4 | key4    | 4
 1416445888 |   12345 | someEvent     |       1 | key1    | val1
 1416445888 |   12345 | someEvent     |       2 | key2    | val2
 1416445888 |   12345 | someEvent     |       3 | key3    | val3
 1416445889 |   12346 | Windows crash |       1 | key1    | Again
 1416445889 |   12346 | Windows crash |       2 | key2    | Worst Time
 1416445889 |   12346 | Windows crash |       3 | key3    | In a presentation
 1416445889 |   12346 | Windows crash |       4 | key4    | Got fired
(11 rows)

Now your queries should be super fast.

Code: Select all

dbadmin=> select * from test where user_id = 12345 and event_name = 'System reboot' and key_seq = 3;
   ts_id    | user_id |  event_name   | key_seq | key_col | val_col
------------+---------+---------------+---------+---------+---------
 1416445888 |   12345 | System reboot |       3 | key3    | 3
(1 row)

abk · Post by **abk** » Thu Nov 20, 2014 6:44 pm

Hi Jim (and the person who replied through messages, but I don't have permissions to reply yet!),

Thanks for the all the advice and quick reply. It sounds like this should work quite well for our use case!

Post by **JimKnicely** » Fri Nov 21, 2014 1:37 pm

...but I don't have permissions to reply yet...

#abk,

You should be able to PM now. Please let me know if you have any issues.

Thanks,
Jim

abk · Post by **abk** » Sat Nov 22, 2014 6:41 pm

Hmm, I logged in and out of my account but can't PM yet, it says I may need to participate more before I have permissions.

@skwa:

The goal for this is to replace a long set of hive workflows. Right now, it's mostly exploratory work to see how fast we can go from raw logs in the format described in my OP to useful, processed data. The current solution in hivesql takes 5+ hours. I would like to be able to port those hivesql queries to verticasql but am open to a large amount of rewriting in the interest of performance.

There are relatively few limitations:

- I have OS access and can compile, I was running vertica 7.0.x but upgraded to 7.1 yesterday to try the vertica json functions. I can also install any external tools as needed.

- I have access to up to 20 nodes.

- In terms of the data itself, that unstructured table will have 1-2B records per day. The current workflow involves a lot of joins between that table and much smaller tables (typically hundreds to thousands of rows, some up to the lower millions), and most of it will deal with the current / previous day of data but there will be some historical queries (current hive table is partitioned by date). I will be importing the data into vertica daily, but once everything is running I will be looking to import the data throughout the day. I can do pre-processing of the data before / after loading it into vertica as needed.

You rated your solutions as:

1) Vertica SHELL-PACKAGE
2) Transform the data to JSON first and use a hybrid table
3) Use the vertica JSON functions

I'm assuming 2 is preferable to 3 because it doesn't have to reprocess the json data all the time. What are the advantages of 1?

Thanks again,
Alex

Migrating semi-structured data

Migrating semi-structured data

Re: Migrating semi-structured data

Re: Migrating semi-structured data

Re: Migrating semi-structured data

Re: Migrating semi-structured data