Trying to fix Invalid CSV File

John Machin sjmachin at lexicon.net
Tue Aug 5 19:30:04 EDT 2008


On Aug 5, 6:56 am, Larry Bates <larry.ba... at websafe.com`> wrote:
> Ryan Rosario wrote:
> > On Aug 4, 8:30 am, Emile van Sebille <em... at fenx.com> wrote:
> >> John Machin wrote:
> >>> On Aug 4, 6:15 pm, Ryan Rosario <uclamath... at gmail.com> wrote:
> >>>> On Aug 4, 1:01 am, John Machin <sjmac... at lexicon.net> wrote:
> >>>>> On Aug 4, 5:49 pm, Ryan Rosario <uclamath... at gmail.com> wrote:
> >>>>>> Thanks Emile! Works almost perfectly, but is there some way I can
> >>>>>> adapt this to quote fields that contain a comma in them?
> >> <snip>
>
> >>> Emile's snippet is pushing it through the csv reading process, to
> >>> demonstrate that his series of replaces works (on your *sole* example,
> >>> at least).
> >> Exactly -- just print out the results of the passed argument:
>
> >> rec.replace(',"',",'''").replace('",',"''',").replace('"','""').replace("'''",'"')
>
> >> '123,"Here is some, text ""and some quoted text"" where the quotes
> >> should have been doubled",321'
>
> >> Where it won't work is if any of the field embedded quotes are next to
> >> commas.
>
> >> I'd run it against the file.  Presumably, you've got a consistent field
> >> count expectation per record.  Any resulting record not matching is
> >> suspect and will identify records this approach won't address.
>
> >> There's probably better ways, but sometimes it's fun to create
> >> executable line noise.  :)
>
> >> Emile
>
> > Thanks for your responses. I think John may be right that I am reading
> > it a second time. I will take a look at the CSV reader documentation
> > and see if that helps. Then once I run it I can see if I need to worry
> > about the comma-next-to-quote issue.
>
> This is a perfect demonstration of why tab delimited files are so much better
> than comma and quote delimited.

No, it's a perfect demonstration of what happens when a protocol is
not followed.

>  Virtually all software can handle table
> delimited as well as comma and quote delimited, but you would have none of these
> problems if you had used tab delimited.  The chances of tabs being embedded in
> most data is virtually nil.
>

There may be no tabs in *your* data. There is no guarantee that there
are no tabs in a VARCHAR(n) column in somebody else's database. I've
seen all of \x00, \t, \n, \r and \x1a (Ctrl-Z (EOF in CP/M, *DOS and
Windows text files)).

The possibilities include (1) Don't check (2) check if '\t' in field
and raise an exception (3) silently remove tabs; what do you
recommend?



More information about the Python-list mailing list