Trouble saving unicode text to file

"Martin v. Löwis" martin at v.loewis.de
Mon May 9 17:44:35 EDT 2005


John Machin wrote:
> Terminology disambiguation: what I call "users" wouldn't know what
> 'cp1252' and 'iso-8859-1' were. They're not expected to know. They
> just type in whatever characters they can see on their keyboard or
> find in the charmap utility. It's what I'd call 'admins' and
> 'developers' who should know better, but often don't.

I was talking about 'users' of Python, so they are 'developers'.
They often don't know what cp1252 is.

> 1. When exchanging data across systems, should not utf-8 be
> preferred???

It depends on the data, of course. People writing UTF-8 into
text files often find that their editors don't display them
correctly, in which case UTF-8 might not be the best choice.
For example, the Python source code in CVS is required to be
iso-8859-1, primarily because this is what interoperates best
across all development platforms.

For data in XHTML, the answer would be different: every XML
processor is supposed to support UTF-8.

> 2. If the Windows *users* have been using characters that are in
> cp1252 but not in iso-8859-1, then attempting to convert to iso-8859-1
> will cause an exception. 

Correct.

> I find it a bit hard to imagine that the euro sign wouldn't get a fair
> bit of usage in Swedish data processing even if it's not their own
> currency.

Yes, so the question is how to represent it. It all depends on the
application, but it is safer to only assume iso-8859-1 for the moment,
unless it is guaranteed that all code that reads the file in really
knows what cp1252 is, and what \x80 means in that charset.

> 3. How portable is a character set that doesn't include the euro sign?

Well, how portable is ASCII? It doesn't support certain characters,
sure. If you don't need these characters, this is not a problem. If
you do need the extra characters, you need to think thoroughly what
encoding meets your needs best. I was merely suggesting that cp1252
is often used without that thought, causing moji-bake later.

If representation of the euro sign is an issue, the choices are
iso-8859-15, cp1252, and UTF-8. Of those three, I would pick
cp1252 last if at all possible, because it is specific to a
vendor (i.e. non-standard)

Regards,
Martin



More information about the Python-list mailing list