Help needed: Unicode and file format problem

Pekka Niiranen pekka.niiranen at wlanmail.com
Tue Sep 21 14:06:27 EDT 2004


Hi gurus,

I have stored Excel97 -table in ascii csv -file.
The table contains pairs of values: string and its replacement.
I parse the csv-file and run "search and replace" to other
files with "sed" and "perl" -scripts in Cygwin environment.
Sometimes the replacements are nonascii strings
(Chinese characters for example). This does not matter
because I am treating all files as list of bytes of which
some are replaced.

Now, however, the file to which I run search and replace has to
be in UTF-8 format. How can I run Unicode regural expressions in Python 
when csv -file contains ascii mixed with some nonascii characters?
How can I work out corresponding Unicode character out of bare bytes?

In other words I have to open target file like this:
fileObj = codecs.open( "File_to_be_modified", "w", "utf-8" )
and then run Unicode regular expression to it, where read
replacements are bytes that must be written out as UTF-8 strings.

I could try to read directly from Excel to Python thru COM interface
or try to created Python COM service that is called from the Excel, but
I would hate to do that. I could also try switch to Excel2000 which
supports UTF-8 as saving format, but there are other
issues (VBA code) involved.

-pekka-




More information about the Python-list mailing list