how to write a unicode string to a file ?

Stephen Hansen apt.shansen at gmail.com
Fri Oct 16 20:49:41 EDT 2009


On Fri, Oct 16, 2009 at 5:07 PM, Stef Mientki <stef.mientki at gmail.com>wrote:

Unfortunately, there is no simple answer to these questions.


> Thanks guys,
> I didn't know the codecs module,
> and the codecs seems to be a good solution,
> at least it can safely write a file.
> But now I have to open that file in Excel 2000 ... 2007,
> and I get something completely wrong.
> After changing codecs to latin-1 or windows-1252,
> everything works fine.
>
> Which of the 2 should I use latin-1 or windows-1252 ?
>
>
You should use the encoding that the file is expected to be in in; it was
saved in a certain, explicit encoding. You may be able to find what it is on
the web, but you'll have to find it and use that. Every file may be
different.

There's no universal right answer; and there's not really any way to tell
what the answer SHOULD be, short of -- trying various encodings until one
works. Doing research to find out what this other-program saves or expects
to open is all you can do. It wouldn't surprise me if Excel used cp1252 by
default; that's vaguely sorta like ISO-8859-1 (also known as latin1), except
in the high byte range. The two are similar enough that they are confused in
a lot of software with odd results.

The thing is, I'd be VERY surprised (neigh, shocked!) if Excel can't open a
file that is in UTF8-- it just might need to be TOLD that its utf8 when you
go and open the file, as UTF8 looks just like ASCII -- until it contains
characters that can't be expressed in ASCII. But I don't know what type of
file it is you're saving.



> And a more general question, how should I organize my Python programs ?
> In general I've data coming from Excel, Delphi, SQLite.
> In Python I always use wxPython, so I'm forced to use unicode.
> My output often needs to be exported to Excel, SPSS, SQLite.
> So would this be a good design ?
>

I have no idea what SPSS is, but in general the way I handle these issues
are by following these rules:

-  Convert to Unicode from the earliest possible point; the moment data gets
into my code, I convert it to unicode.
  - There has to be some heuristics / intelligent tests to determine HOW you
convert it to unicode: you really have to /know/ before-hand what the data
was encoded in before, in order to do so. You can often assume its ASCII,
but unfortunately, that only works until that moment when its not. And it
will eventually be not, guaranteed; you will have to base this decision on
the type of source you're getting the data from. Is it from a file, if so,
what kind of file? Does the program which produced it always write out a
certain encoding? Or is it variable? Is it something user-specified (e.g.,
in an environment variable or preference), etc? Regardless, the moment you
get data-- convert it into unicode, with unicode(data,
"<original-encoding>") ... the original-encoding is whatever you determine
the encoding the data was in before you got it.

- All private storage should be stored as unicode, encoded to UTF8. Private
meaning, other programs you don't control don't mess with it. This should
include data files AND databases-- you should be storing unicode as UTF in
SQLIte. See the 'pragma encoding' instruction at
http://www.sqlite.org/pragma.html

- Do all processing in your program as unicode.

- Encode the data at the last possible moment during the output process,
according to whatever it needs to be; if at all possible encode at output as
UTF8 if other programs can handle it, as life will one day be better when
all programs can be on the same page here. But that's not always possible:
when not, be sure the decision of what encoding to use when writing the data
out is something your program remembers or can determine at a later point--
so that when/if you need to read it in, you know what encoding it was
written out to.

- Be prepared when writing data out to experience an error if the internal
unicode data contains a character which can't be expressed in the limited
output encoding if you're forced to use something non-UTF8, like latin1 or
cp1252. If you're constrained to having to support these limited character
sets, then you're going to have to make sure you handle that situation
gracefully-- either by using an error handler when you encode it (e.g.,
line.encode("latin1", "ignore") which will exclude any characters that can't
be handled in latin1, or I usually prefer line.encode("latin1",
"xmlcharrefreplace") which will replace the non-working characters with
&x1234; type notation) or by including validators in your UI which reject
characters that can't fit into a desired encoding. Even with that validator,
use unicode-strings internally.

- Mourn for the good ol' days when you actually could imagine the pleasant
fiction as something such as 'plain text' existing-- it never really has
existed. :)

So: I'd use unicode entirely -- in wxPython, stored in your database, and
you should be able to use it in Delphi too I believe? Its been years since I
used delphi, but. At the point where there's a barrier between 'your' stuff
and 'other stuff', you convert from unicode into an encoding-- if you must.

Anyways. That's just how I handle it, and rarely run into problems-- Only
really when dealing with some new file format from strange old systems when
its not obvious what encoding its in due to poor-documentation and naïve
implementations.

-- 
Stephen Hansen
Development
Advanced Prepress Technology

shansen at advpubtech.com
(818) 748-9282
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20091016/a3f53eef/attachment-0001.html>


More information about the Python-list mailing list