[Tutor] UTF-8 title() string method

Kent Johnson kent37 at tds.net
Thu Jul 5 19:52:54 CEST 2007


Jon Crump wrote:
> On Wed, 4 Jul 2007, Kent Johnson wrote:

>> First, don't confuse unicode and utf-8.
> 
> Too late ;-) already pitifully confused.

This is a good place to start correcting that:
http://www.joelonsoftware.com/articles/Unicode.html

>> Second, convert the string to unicode and then title-case it, then 
>> convert back to utf-8 if you need to:
> 
> I'm having trouble figuring out where, in the context of my code, to 
> effect these translations.

if s is your utf-8 string, instead of s.title(), use 
s.decode('utf-8').title().encode('utf-8')

> In parsing the text file, I depend on 
> matching a re:
> 
> if re.match(r'[A-Z]{2,}', line)
> 
> to identify and process the place name data. If I translate the line to 
> unicode, the re fails.

I don't know why that is, re works with unicode strings:
In [1]: import re
In [2]: re.match(r'[A-Z]{2,}', 'ABC')
Out[2]: <_sre.SRE_Match object at 0x12078e0>
In [3]: re.match(r'[A-Z]{2,}', u'ABC')
Out[3]: <_sre.SRE_Match object at 0x11c1f00>

Kent



More information about the Tutor mailing list