[Tutor] UTF-8 title() string method

Jon Crump jjcrump at myuw.net
Thu Jul 5 20:26:31 CEST 2007


On Thu, 5 Jul 2007, Kent Johnson wrote:

>> First, don't confuse unicode and utf-8.
>
> Too late ;-) already pitifully confused.

> This is a good place to start correcting that:
> http://www.joelonsoftware.com/articles/Unicode.html

Thanks for this, it's just what I needed!

> if s is your utf-8 string, instead of s.title(), use 
> s.decode('utf-8').title().encode('utf-8')

Eureka! I was trying to make the round-trip _way_ too complicated.

>> to identify and process the place name data. If I translate the line to 
>> unicode, the re fails.
>
> I don't know why that is, re works with unicode strings:
> In [1]: import re
> In [2]: re.match(r'[A-Z]{2,}', 'ABC')
> Out[2]: <_sre.SRE_Match object at 0x12078e0>
> In [3]: re.match(r'[A-Z]{2,}', u'ABC')
> Out[3]: <_sre.SRE_Match object at 0x11c1f00>

Of course. I was misinterpreting why things were failing. It wasn't the 
regex, it was the decode() encode() round-trip. (a powerful argument for 
getting familiar with try/except error handling!)

Again, many thanks for the education!

Jon


More information about the Tutor mailing list