[Tutor] Problems processing accented characters in ISO-8859-1 encoded texts

Josep M. Fontana josep.m.fontana at gmail.com
Thu Dec 23 10:25:01 CET 2010


I am working with texts that are encoded as ISO-8859-1. I have
included the following two lines at the beginning of my python script:

!/usr/bin/env python
# -*- coding: iso-8859-1 -*-

If I'm not mistaken, this should tell Python that accented characters
such as 'á', 'Á', 'ö' or 'è' should be considered as alpha-numeric
characters and therefore matched with a regular expression of the form
[a-zA-Z]. However, when I process my texts, all of the accented
characters are matched as non alpha-numeric symbols. What am I doing
wrong?

I'm not including the whole script because I think the rest of the
code is irrelevant. All that's relevant (I think) is that I'm using
the regular expression '[^a-zA-Z\t\n\r\f\v]+' to match any string that
includes non alpha-numeric characters and that returns 'á', 'Á', 'ö'
or 'è' as well as other real non alpha-numeric characters.

Has anybody else experienced this problem when working with texts
encoded as ISO-8859-1 or UTF-8? Is there any additional flag or
parameter that I should add to make the processing of these characters
as regular word characters possible?

Thanks in advance for your help.

Josep M.


More information about the Tutor mailing list