[Tutor] more encoding confusion

Jon Crump jjcrump at myuw.net
Sun Aug 5 20:28:55 CEST 2007


Kent, Many thanks again, and thanks too to Paul at
http://tinyurl.com/yrl8cy.

That's very effective, thanks very much for the detailed explanation;
however, I'm a little surprised that it's necessary. I would have thought
that there would be some standard module that included a unicode 
equivalent
of the builtin method isupper().

On Fri, 3 Aug 2007, Kent Johnson wrote:

> > What sort of re test can I do to catch lines whose defining
> > characteristic is that they begin with two or more adjacent utf-8
> > encoded capital letters?
>
> First you have to decode the file to a Unicode string.
> Then build the set of matching characters and build a regex. For 
example,
> something like this:
>
> data = open('data.txt').read().decode('utf-8').splitlines()
>
> uppers = u''.join(unichr(i) for i in xrange(sys.maxunicode)
>                    if unichr(i).isupper())

I modified uppers to include only the latin characters, and added the
apostrophe to catch placenames like L'ISLE.

> upperRe = u'^[%s]{2,}' % uppers
>
> for line in data:
>  if re.match(upperRe, line):
>
>
> With a tip of the hat to
> http://tinyurl.com/yrl8cy
>
> Kent



More information about the Tutor mailing list