[Csv] Re: PEP 305 - Comments (really long post)
Carlos Ribeiro
cribeiro at mail.inet.com.br
Thu Feb 6 14:36:21 CET 2003
Skip,
Well, nobody can say that I didn't try :-) I almost giving up on my crusade to
convince you that numbers should be converted by the csv library. It seems
that we started from different assumptions, but now I think I've understood
what are your objectives.
I still have a few points to make, though:
1) There is one reason left to convert numbers before returning them, and this
has a lot to do with information that is discarded in the process. Let us
follow this example:
"row 1";10 --> ("row 1", "10")
The second item of the returned tuple is a string, as you stated in your
answer. The problem is that my application has no way to know if the value
was originally written in the csv file with or without quotes; this
information is lost because all values are 'normalized' by the csv library.
If I know the structure of the csv file, then it's fine, but it's not so nice
when you're trying to detect the structure of an arbitrary csv file. Take a
look at another example, where the first column is called 'code', the second
column is 'description', and the third one is 'cost'. Note that this example
is similar to the structure used for files exported from project management
software:
"1", "Project phase", 2000
"1.1", "Requirement analysis", 1000
"1.1", "Architectural design", 1000
In this case, MS Excel will detect that the first column as a string, but will
convert values in the third one to numeric format. It can do that because he
knows that the first column values were quoted, and the third one isn't. Now,
when you return a tuple of strings, the user has no way to know if the quotes
were or not present in the original file.
There are few solutions for this problem, none of them fully satisfactory:
a) return the strings as proposed by you, which leaves the library unusable
for situations as described above;
b) return strings in such a way that the original quotes are preserved. Then
it will be up to the user to remove the extra quotes from the "real" strings;
c) convert unquoted numeric values to native numbers (ints or floats) when
returning the row (as proposed by myself in my previous messages);
d) provide an alternative method to retrieve more information - for example, a
second tuple with a more detailed description of how was the line analysed.
While more complex, this approach has some advantages: (1) it does not make
ths usual code any more complex, and (2) the extra information will help to
implement 'smarter' csvreaders.
Other alternatives may exist, but I think that the list above sums up very
well the practical options.
2) In your answer, you cite the case where some numeric values can be hex, or
whatever base it is. Well, I don't agree with your argument. One of the
Python's mottos is "to make simple things simple". The simplest case are base
10 integers; if the library can deal with them in a sane way, you're solving
the problems of the vast majority of the users. Special cases are just that,
special, and will be treated in a special fashion anyway.
3) I'm not sure if str() is localized for floats. Using the standard
installation of PythonWin with a fully localized copy of Windows, it still
uses periods as decimal point - not commas. I didn't try to change the locale
manually (I never did that before for Python); I'll try and tell you what
happens.
BTW, I'm sure that repr() isn't localized, because the syntax for floats is
not locale-dependent, but you are probably aware of this fact. But I'm
afraid that str() and repr() calls may end up calling the same function in
the case of floats.
4) I'm not convinced that passing a binary file is a good idea. Reading the
PEP I assumed that the csvreader constructor just takes any object that can
return lines. Well, binary file objects do not meet this definition. It would
make the system much less flexible, making it more difficult to pass
arbitrary iterables to the csv library.
For the sake of simplicity and clarity, why not leave the line termination
option out of the csv library, in such a way that it can be implemented in
the file object passed to the reader? The csv file would be less dependent on
implementation details of the file, focusing more on how to interpret the
content of the lines.
5) I agree that fixed width text files are different beasts. Anyway, it should
be possible to implement it using the same interface (or API, whatever you
like calling it). Things like that make the learning curve smoother. But we
can leave this discussion for a later time.
Thanks for your comments, and please forgive my insistence :-)
Carlos Ribeiro
cribeiro at mail.inet.com.br
More information about the Csv
mailing list