Using the CSV module

John Machin sjmachin at lexicon.net
Wed May 9 08:24:34 EDT 2007


On May 9, 6:40 pm, "Nathan Harmston" <ratchetg... at googlemail.com>
wrote:
> Hi,
>
> I ve been playing with the CSV module for parsing a few files. A row
> in a file looks like this:
>
> some_id\t|\tsome_data\t|t\some_more_data\t|\tlast_data\t\n
>
> so the lineterminator is \t\n and the delimiter is \t|\t, however when
> I subclass Dialect and try to set delimiter is "\t|\t" it says
> delimiter can only be a character.
>
> I know its an easy fix to just do .strip("\t") on the output I get,
> but I was wondering
> a) if theres a better way of doing this when the file is actually
> being parsed by the csv module

No; usually one would want at least to do .strip() on each field
anyway to remove *all* leading and trailing whitespace. Replacing
multiple whitespace characters with one space is often a good idea.
One may want to get fancier and ensure that NO-BREAK SPACE aka  
(\xA0 in many encodings) is treated as whitespace.

So your gloriously redundant tabs vanish, for free.

> b) Why are delimiters only allowed to be one character in length.

Speed. The reader is a hand-crafted finite-state machine designed to
operate on a byte at a time. Allowing for variable-length delimiters
would increase the complexity and lower the speed -- for what gain?
How often does one see 2-byte or 3-byte delimiters?




More information about the Python-list mailing list