Trying to fix Invalid CSV File

Ryan Rosario uclamathguy at gmail.com
Wed Aug 6 01:20:06 EDT 2008


On Aug 4, 1:56 pm, Larry Bates <larry.ba... at websafe.com`> wrote:
> Ryan Rosario wrote:
> > On Aug 4, 8:30 am, Emile van Sebille <em... at fenx.com> wrote:
> >> John Machin wrote:
> >>> On Aug 4, 6:15 pm, Ryan Rosario <uclamath... at gmail.com> wrote:
> >>>> On Aug 4, 1:01 am, John Machin <sjmac... at lexicon.net> wrote:
> >>>>> On Aug 4, 5:49 pm, Ryan Rosario <uclamath... at gmail.com> wrote:
> >>>>>> Thanks Emile! Works almost perfectly, but is there some way I can
> >>>>>> adapt this to quote fields that contain a comma in them?
> >> <snip>
>
> >>> Emile's snippet is pushing it through thecsvreading process, to
> >>> demonstrate that his series of replaces works (on your *sole* example,
> >>> at least).
> >> Exactly -- just print out the results of the passed argument:
>
> >> rec.replace(',"',",'''").replace('",',"''',").replace('"','""').replace("'''",'"')
>
> >> '123,"Here is some, text ""and some quoted text"" where the quotes
> >> should have been doubled",321'
>
> >> Where it won't work is if any of the field embedded quotes are next to
> >> commas.
>
> >> I'd run it against the file.  Presumably, you've got a consistent field
> >> count expectation per record.  Any resulting record not matching is
> >> suspect and will identify records this approach won't address.
>
> >> There's probably better ways, but sometimes it's fun to create
> >> executable line noise.  :)
>
> >> Emile
>
> > Thanks for your responses. I think John may be right that I am reading
> > it a second time. I will take a look at theCSVreader documentation
> > and see if that helps. Then once I run it I can see if I need to worry
> > about the comma-next-to-quote issue.
>
> This is a perfect demonstration of why tab delimited files are so much better
> than comma and quote delimited.  Virtually all software can handle table
> delimited as well as comma and quote delimited, but you would have none of these
> problems if you had used tab delimited.  The chances of tabs being embedded in
> most data is virtually nil.
>
> -Larry

Thank you for all the help. I wasn't using Emile's code correctly. It
fixed 99% of the problem, reducing 30,000 bad lines to about 300. The
remaining cases were too difficult to pin a pattern on, so I just
spent an hour fixing those lines. It was typically just adding one
more " to one that was already there.

Next time I am going to be much more careful. Tab delimited is
probably better for my purpose, but I can definitely see there being
issues with invisible tab characters and other weirdness.

Ryan



More information about the Python-list mailing list