Trouble saving unicode text to file

John Machin sjmachin at lexicon.net
Sun May 8 18:39:40 EDT 2005


On Sun, 08 May 2005 19:49:42 +0200, "Martin v. Löwis"
<martin at v.loewis.de> wrote:

>John Machin wrote:
>> Martin, I can't guess the reason for this last suggestion; why should
>> a Windows system use iso-8859-1 instead of cp1252?
>
>Windows users often think that windows-1252 is the same thing as
>iso-8859-1, and then exchange data in windows-1252, but declare them
>as iso-8859-1 (in particular, this is common for HTML files).
>iso-8859-1 is more portable than windows-1252, so it should be
>preferred when the data need to be exchanged across systems.

Martin, it seems I'm still a long way short of enlightenment; please
bear with me:

Terminology disambiguation: what I call "users" wouldn't know what
'cp1252' and 'iso-8859-1' were. They're not expected to know. They
just type in whatever characters they can see on their keyboard or
find in the charmap utility. It's what I'd call 'admins' and
'developers' who should know better, but often don't.

1. When exchanging data across systems, should not utf-8 be
preferred???

2. If the Windows *users* have been using characters that are in
cp1252 but not in iso-8859-1, then attempting to convert to iso-8859-1
will cause an exception. 

>>> euro_win = chr(128)
>>> euro_uc = euro_win.decode('cp1252')
>>> euro_uc
u'\u20ac'
>>> unicodedata.name(euro_uc)
'EURO SIGN'
>>> euro_iso = euro_uc.encode('iso-8859-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u20ac'
in position 0: ordinal not in range(256)
>>>

I find it a bit hard to imagine that the euro sign wouldn't get a fair
bit of usage in Swedish data processing even if it's not their own
currency.

3. How portable is a character set that doesn't include the euro sign?

Regards,
John



More information about the Python-list mailing list