Unicode and Zipfile problems

Gerson Kurz gerson.kurz at t-online.de
Fri Nov 7 14:54:01 EST 2003


martin at v.loewis.de wrote:

>I'd be curious what precisely it is that has two modes.

The Python interpreter. 

>> "dontcare" can be implemented easily enough by changing site.py (I
>> only have to find out how to remove that s***** DepracationWarning
>> introduced in 2.3 for source with german comments, for christs sake!
>> comments!). 
>
>You can add a warning filter to site.py. 

Thanks, but could you please elaborate a little on that? Are you
suggesting I write my own import hook to filter the warning there? Is
there any other mode? It seems the warning is generated inside the
python core (tokenizer.c, 478) and not in the lib python files. 

>else "dontcare" would do, specifically.

It would be like a Python build without Unicode support. A cursory
glance at the Python source reveals that probably

#ifdef Py_WIN_WIDE_FILENAMES

is used to encapsulate the W-API of windows; but it seems that unicode
support is rather deeply entrenched in the source (at least there are
no obvious #ifdefs in unicodeobject.c / unicodectype.c) - damnit I was
hoping this to go fast ;) 

Basically, what is annoying about the way python handles unicode is
this:

a) you get warnings when you do stuff you've been doing for years
without ever getting any warning. 

b) it forces you to be correct - even when you don't care. 

Now of course there is nothing wrong in being correct: it is only that
sometimes it is not worth the effort and you don't care and you are
STILL forced to care about it. 

Like, you want to write a small script that dumps some registry keys
to stdout. Bang, you get an exception because there is a German umlaut
in one of these. Now, a C ANSI code that dumps the registry will
perhaps not display the right character, screw the console CP, but at
least you can read the thing, because you're a human and not a stupid
computer and that is all that matters. And normally you don't even
know WHAT encoding to use. 

Maybe its time for a "UNICODE for dummies" section in the python
manual. But maybe its also time for a more relaxed way of handling all
that?

So, back to the two ways in which the Python unicode handling is
annoying - it would be fine if you could easily change "strict"
encoding to "relaxed" (I'm not sure about that, but toying with my
dontcare.py I see that there is a parameter to the en/coding
functions, so maybe one could set default encoding = OS locale
encoding (see below) and disable exceptions when something goes wrong.

That way 

a) you DONT get warnings when you do stuff you've been doing for years
without ever getting any warning. 

b) it DOES NOT force you to be correct - even when you don't care. 

so I at least would be happy with that. 

>Yes, please do. What is the difference between the C implementation
>and the OS implementation?

a) Last time I checked, strftime gives you a date and time
representation for the current locale. As in: one date and time
representation ("%x %X"). However, you have like long and short dates.
Ask the simple question: do you put the time before the date or after?
You have times that require millisecond precision (e.g. timestamps in
a tracefile) and you have times that are just hours. If you want to
include milliseconds, you will have to resort to a manual guess as to
what the time format is, and so on. Not including the fact that, at
least on windows, there are at least three different time "classes"
used in python, from the time module, the datetime module, and
win32api.GetLocalTime() / win32api.GetSystemTime().

b) The C implementation is part of the C implementation (as the name
would indicate) and you can read for example in the CRT sources; if
you install DevStudio6, you will find it e.g. here

C:\Program Files\Microsoft Visual Studio\VC98\CRT\SRC\STRFTIME.C

(The default install doesn't copy these files, you have to set
"advanced" options during setup IIRC). The OS version is a set of API
functions called the "National Language Support Functions", which
contains the functions GetDateFormat and GetTimeFormat which have a
completely different syntax and are used by other applications (such
as, yuk, VB). If you look at the API documentation, you'll notice that
the two versions have different options.

c) I run an english version of Windows 2000, but I have german locale
settings. Windows distinguishes between "system locale" and "user
locale". Many applications, virtually all of them, use the user lcoale
settings (that is, german). Python uses the C default which is - well
I'm not really sure whether or not its english, but it certainly isn't
german by (OS) default.

d) The documentation for the locale format says you should set "de" or
"de_DE", but "GERMAN" is the actual locale for "german". But how do
you know? And how do you add functionality to your application to
always use the users locale (ie German on my english system - as any
other app including stupid MFC apps can do)? 




More information about the Python-list mailing list