8-bit cleanliness

Thomas Wouters thomas at xs4all.net
Sun Jun 11 12:54:53 EDT 2000


On Sun, Jun 11, 2000 at 05:55:36PM +0200, Rafael Cordones Marcos wrote:

> I just started to use Python a few days ago because I got fed up of so much convoluted Perl code. ;)
> Anyway, I have to read some text files and process the words appearing in them. I have discovered, to
> my surprise, that accents like (á, à, ...) get replaced by a 4 character code. Is there any class/module/option
> available to read *text* files with non english characters in them?

Those non-ASCII characters do not get replaced by that 4-character code,
those non-ASCII characters *are* that 4-character code ;-) If you use repr()
on strings, non-printable characters are expressed as an octal number, to be
able to reliably reproduce them:

>>> s = "áááárgh"
>>> s			# which is the same as repr(s), in interactive mode
'\341\341\341\341rgh'

>>> print s
áááárgh

\341 is the 'accurate' representation of the 'á' character, it'll always be
converted in the same actual character, regardless of your font settings.
How it is displayed depends on your font or your locale settings, depending
on what you use to view it ;-)

Just treat your strings as data, as you should anyway, and all will end up
fine. Just be sure not to use 'repr()' (or ``) when you really mean 'print'
or 'str()'.

-- 
Thomas Wouters <thomas at xs4all.net>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!




More information about the Python-list mailing list