[Csv] Re: PEP 305 - Comments (really long post)

Thu Feb 6 14:36:21 CET 2003

Skip,

Well, nobody can say that I didn't try :-) I almost giving up on my crusade to 
convince you that numbers should be converted by the csv library. It seems 
that we started from different assumptions, but now I think I've understood 
what are your objectives.

I still have a few points to make, though:

1) There is one reason left to convert numbers before returning them, and this 
has a lot to do with information that is discarded in the process. Let us 
follow this example:

"row 1";10   -->  ("row 1", "10")

The second item of the returned tuple is a string, as you stated in your 
answer. The problem is that my application has no way to know if the value 
was originally written in the csv file with or without quotes; this 
information is lost because all values are 'normalized' by the csv library.

If I know the structure of the csv file, then it's fine, but it's not so nice 
when you're trying to detect the structure of an arbitrary csv file. Take a 
look at another example, where the first column is called 'code', the second 
column is 'description', and the third one is 'cost'. Note that this example 
is similar to the structure used for files exported from project management 
software:

"1", "Project phase", 2000
"1.1", "Requirement analysis", 1000
"1.1", "Architectural design", 1000

In this case, MS Excel will detect that the first column as a string, but will 
convert values in the third one to numeric format. It can do that because he 
knows that the first column values were quoted, and the third one isn't. Now, 
when you return a tuple of strings, the user has no way to know if the quotes 
were or not present in the original file.

There are few solutions for this problem, none of them fully satisfactory:

a) return the strings as proposed by you, which leaves the library unusable 
for situations as described above;

b) return strings in such a way that the original quotes are preserved. Then 
it will be up to the user to remove the extra quotes from the "real" strings;

c) convert unquoted numeric values to native numbers (ints or floats) when 
returning the row (as proposed by myself in my previous messages);

d) provide an alternative method to retrieve more information - for example, a 
second tuple with a more detailed description of how was the line analysed. 
While more complex, this approach has some advantages: (1) it does not make 
ths usual code any more complex, and (2) the extra information will help to 
implement 'smarter' csvreaders.

Other alternatives may exist, but I think that the list above sums up very 
well the practical options.

2) In your answer, you cite the case where some numeric values can be hex, or 
whatever base it is. Well, I don't agree with your argument. One of the 
Python's mottos is "to make simple things simple". The simplest case are base 
10 integers; if the library can deal with them in a sane way, you're solving 
the problems of the vast majority of the users. Special cases are just that, 
special, and will be treated in a special fashion anyway.

3) I'm not sure if str() is localized for floats. Using the standard 
installation of PythonWin with a fully localized copy of Windows, it still 
uses periods as decimal point - not commas. I didn't try to change the locale 
manually (I never did that before for Python); I'll try and tell you what 
happens.

BTW, I'm sure that repr() isn't localized, because the syntax for floats is 
not locale-dependent, but you are probably aware of this fact. But I'm  
afraid that str() and repr() calls may end up calling the same function in 
the case of floats.

4) I'm not convinced that passing a binary file is a good idea. Reading the 
PEP I assumed that the csvreader constructor just takes any object that can 
return lines. Well, binary file objects do not meet this definition. It would 
make the system much less flexible, making it more difficult to pass 
arbitrary iterables to the csv library.

For the sake of simplicity and clarity, why not leave the line termination 
option out of the csv library, in such a way that it can be implemented in 
the file object passed to the reader? The csv file would be less dependent on 
implementation details of the file, focusing more on how to interpret the 
content of the lines.

5) I agree that fixed width text files are different beasts. Anyway, it should 
be possible to implement it using the same interface (or API, whatever you 
like calling it). Things like that make the learning curve smoother. But we 
can leave this discussion for a later time.

Thanks for your comments, and please forgive my insistence :-)

Carlos Ribeiro
cribeiro at mail.inet.com.br