Unicode problem.... as always

Harvey Thomas hst at empolis.co.uk
Tue Jul 1 12:37:26 EDT 2003


Todd Jenista wrote:
> Sent: 01 July 2003 13:20
> To: python-list at python.org
> Subject: Unicode problem.... as always
> 
> 
> I have a parser I am building with python and, unfortunately, people
> have decided to put unicode characters in the files I am parsing.
> The parser seems to have a fit when I search for one \uXXXX symbol,
> and there is another unicode symbol in the file. In this case, a
> search and replace for © with a µ in the file causes the infamous
> ordinal error.
> My quick-fix, because they have good context, is to change them both
> to "UTF8", and then attempt to replace the UTF8 at the end with the
> original µ. The problem is that I am getting a µ when I try to
> re-insert using \u00b5 which is the UTF8 code.
> Words of wisdom would be greatly appreciated.
> -- 

I think the root of your problem lies in your remark that "people
have decided to put unicode characters in the files I am parsing".

There is no such thing as a file of Unicode characters. There are, however,
files in unicode encodings (such as UTF-8), which when read 
appropriately via codecs.open functions yield Unicode strings. 
Similarly you can write Unicode files to appropriate encodings.

You don't tell us much about what you want to do, but why don't you just 
read and write files in UTF-8 via the codecs module and do your
manipulations on Unicode strings?

_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.





More information about the Python-list mailing list