Failed to load CSV file which as line break in quoted column

Moderator: NorbertKrupa

Post Reply
njvijay
Newbie
Newbie
Posts: 2
Joined: Sat Aug 17, 2013 1:19 am

Failed to load CSV file which as line break in quoted column

Post by njvijay » Thu Aug 22, 2013 4:20 am

I am trying to load a CSV file which has a string column in quoted string. Unfortunately I have \n (new line) in the quoted string. Here is the example

Cust_id, cust_address, city,zip
1, "1289 cobb parkway
Bufford", "ATLANTA",34343
2, "1234 IVY lane
Decatur", "ATLANTA",23435

I am trying to use following copy command

COPY tempdb.test_cust
FROM LOCAL 'test.dat'
DELIMITER ','
ENCLOSED BY '"'
NO ESCAPE
SKIP 1
DIRECT;

Above copy script fails with error message

vsql:test.vsql:24: ERROR 2035: COPY(tempdb.test_cust): Input record 1 has been rejected (Too few columns found)

It appears to me line break in between quoted string is considered as record delimiter. Do you have any solution to load this type of delimited file using COPY?

User avatar
nnani
Master
Master
Posts: 302
Joined: Fri Apr 13, 2012 6:28 am
Contact:

Re: Failed to load CSV file which as line break in quoted co

Post by nnani » Thu Aug 22, 2013 8:39 am

Hello Vijay,

Welcome to Vertica forums.

The solution to your problem can be in two steps.

Step 1
If you file is as mentioned pattern, then you need a record terminator on every alternate line
Either you can do this in Linux using a simple awk command

Code: Select all

1, "1289 cobb parkway
 Bufford", "ATLANTA",34343
2, "1234 IVY lane
 Decatur", "ATLANTA",23435

awk '!(NR%2){$(NF+1)=";"}1' filename

1, "1289 cobb parkway
 Bufford", "ATLANTA",34343;
2, "1234 IVY lane
 Decatur", "ATLANTA",23435;
I have placed a record terminator ';' on every alternate line.
Step2
Now coming to Vertica, as you have record terminator defined in the csv file now, you can use the record terminator function with the COPY command this way.

Code: Select all

COPY tempdb.test_cust
FROM LOCAL 'test.dat'
DELIMITER ','
RECORD TERMINATOR ';'
ENCLOSED BY '"'
NO ESCAPE
SKIP 1
DIRECT;


This should do the trick for you.
Note - If doing this in two step might not be a good idea, then there is a package called "External Filter" on Github which we can use to do this in a single COPY command. With external filter you have full access to command line tools while using the COPY command.

Hope this helps.. :)
nnani........
Long way to go

You can check out my blogs at vertica-howto

njvijay
Newbie
Newbie
Posts: 2
Joined: Sat Aug 17, 2013 1:19 am

Re: Failed to load CSV file which as line break in quoted co

Post by njvijay » Thu Aug 22, 2013 3:09 pm

Thanks Navin. I looked at record set. Not all quoted strings have line break. Some records have multiple line break in quoted string. I thought "enclosed by" phrase would not consider \n as record delimiter. Is there any other alternative?

id10t
GURU
GURU
Posts: 732
Joined: Mon Apr 16, 2012 2:44 pm

Re: Failed to load CSV file which as line break in quoted co

Post by id10t » Tue Sep 03, 2013 2:48 am

Hi!

May be too late, but my version (did just for fun):
  • HELP for Python script (let script name 'remove_newline.py'):
    You can change any csv option, for example delimiter, line terminator(in case of windows - '\r\n') and so on. See help.

    Code: Select all

    daniel@synapse:/tmp$ ./replace_newline.py -h
    Usage: replace_newline.py [csv options] -f FILE [-a ACTION]
    
    Options:
      -h, --help            show this help message and exit
      -f FILE, --file=FILE  csv FILE to parse.
      -a ACTION, --action=ACTION
                            0 - escape "\n" to "\\n"; 1 - replace with space; 2 -
                            remove [default: 1]
      -d DELIM, --delimiter=DELIM
                            specifies a one-character string as field separator.
                            [default: ,]
      -q QUOTE, --quote-char=QUOTE
                            specifies a one-character as quoting character.
                            [default: "]
      -e ESC, --escape-char=ESC
                            specifies a one-character as escape charchter.
                            [default: none]
      -D, --double-quote    two consecutive quotes are interpreted as one.
                            [default: False]
      -t, --trim            removes initial whitespaces. [default: True]
      -l NEW_LINE, --line-terminator=NEW_LINE
                            specifies the character sequence which rows
                            terminates. [default: \n]
      -Q QUOTE_TYPE, --quoting=QUOTE_TYPE
                            0- only when required; 1- all; 2- quote non-numeric;
                            3- none [default: 0]
    
  • Python script (let script name 'remove_newline.py'):

    Code: Select all

    #!/usr/bin/env python
    
    import csv
    from optparse import OptionParser
    
    # init command line arguments parser
    usage = "USAGE: %prog [csv options] -f FILE [-a ACTION]"
    parser = OptionParser(usage=usage)
    _ = parser.add_option
    _('-f', '--file', dest='file', metavar='FILE', help='csv FILE to parse.')
    _('-a', '--action', dest='action', type='int', metavar='ACTION', default=1,help='0 - escape "\\n" to "\\\\n"; 1 - replace with space; 2 - remove [default: %default]')
    _('-d', '--delimiter', dest='delimiter', default=',', metavar='DELIM',help='specifies a one-character string as field separator. [default: %default]')
    _('-q', '--quote-char', dest='quotechar', default='"', metavar='QUOTE',help='specifies a one-character as quoting character. [default: %default]')
    _('-e', '--escape-char', dest='escapechar', metavar='ESC', default=None,help='specifies a one-character as escape charchter. [default: %default]')
    _('-D', '--double-quote', dest='doublequote', action="store_true", default=False,help='two consecutive quotes are interpreted as one. [default: %default]')
    _('-t', '--trim', dest='skipinitialspace', action="store_true", default=True,help='removes initial whitespaces. [default: %default]')
    _('-l', '--line-terminator', dest='lineterminator', default='\n', metavar='NEW_LINE',help='specifies the character sequence which rows terminates. [default: \\n]')
    _('-Q', '--quoting', dest='quoting', type='int', default=0, metavar='QUOTE_TYPE',help='0- only when required; 1- all; 2- quote non-numeric; 3- none [default: %default]')
    
    # parse command line arguments
    params = parser.parse_args()[0]
    
    
    # validate command line arguments
    if not params.file:
        parser.error('option -f require')
    
    # init action: apply escape/replace with single space/remove on each new line char
    replace_char = ['\\\\n', ' ', ''][params.__dict__.pop('action')]
    action = lambda field: field.replace('\n', replace_char)
    
    # start
    with open(params.file, 'r') as csv_file:
        del params.file
        params.strict = True                             # errors will remain to Vertica
        csv_reader = csv.reader(csv_file, **params.__dict__)
        for row in csv_reader:
            print params.delimiter.join(map(action, row))
    
    # end
    
  • Csv file example:

    Code: Select all

    daniel@synapse:/tmp$ cat test.csv 
    Cust_id, cust_address, city,zip
    1, "1289 cobb parkway
    Bufford", "ATLANTA",34343
    2, "1234 IVY lane
    Decatur", "ATLANTA",23435
    3, "11223 XXX yyy
    Foo
    
    Bar","ATLANTA",555666
    
  • Script in action:

    Code: Select all

    # set script executable
    daniel@synapse:/tmp$ chmod +x /tmp/replace_newline.py

    Code: Select all

    daniel@synapse:/tmp$ ./replace_newline.py -f /tmp/test.csv
    Cust_id,cust_address,city,zip
    1,1289 cobb parkway Bufford,ATLANTA,34343
    2,1234 IVY lane Decatur,ATLANTA,23435
    3,11223 XXX yyy Foo  Bar,ATLANTA,555666
    

    Code: Select all

    daniel@synapse:/tmp$ ./replace_newline.py -f /tmp/test.csv --action=0
    Cust_id,cust_address,city,zip
    1,1289 cobb parkway\\nBufford,ATLANTA,34343
    2,1234 IVY lane\\nDecatur,ATLANTA,23435
    3,11223 XXX yyy\\nFoo\\n\\nBar,ATLANTA,555666
    daniel@synapse:/tmp$
    
  • How to use with COPY:

    Code: Select all

    daniel@synapse:/tmp$ vsql -c "select * from customer"
     cust_id | cust_address | city | zip 
    ---------+--------------+------+-----
    (0 rows)
    daniel@synapse:/tmp$ ./replace_newline.py -f /tmp/test.csv | vsql -c "copy customer from stdin direct delimiter ','"
    daniel@synapse:/tmp$ vsql -c "select * from customer"
     cust_id |       cust_address        |  city   |  zip   
    ---------+---------------------------+---------+--------
           1 | 1289 cobb parkway Bufford | ATLANTA |  34343
           2 | 1234 IVY lane Decatur     | ATLANTA |  23435
           3 | 11223 XXX yyy Foo  Bar    | ATLANTA | 555666
    (3 rows)
    
    [code]
    daniel@synapse:/tmp$ ./replace_newline.py -f /tmp/test.csv --action=0 | vsql -c "copy customer from stdin direct delimiter ','"
    daniel@synapse:/tmp$ vsql -c "select * from customer"
     cust_id |        cust_address        |  city   |  zip   
    ---------+----------------------------+---------+--------
           1 | 1289 cobb parkway\nBufford | ATLANTA |  34343
           2 | 1234 IVY lane\nDecatur     | ATLANTA |  23435
           3 | 11223 XXX yyy\nFoo\n\nBar  | ATLANTA | 555666
    (3 rows)
    

User avatar
JimKnicely
Site Admin
Site Admin
Posts: 1825
Joined: Sat Jan 21, 2012 4:58 am
Contact:

Re: Failed to load CSV file which as line break in quoted co

Post by JimKnicely » Tue Sep 03, 2013 1:01 pm

sKwa,

That's a pretty cool solution!
Jim Knicely

Image

Note: I work for Vertica. My views, opinions, and thoughts expressed here do not represent those of my employer.

Post Reply

Return to “Vertica Data Load”