[Tutor] UTF-8 title() string method

Thu Jul 5 19:29:54 CEST 2007

On Wed, 4 Jul 2007, Kent Johnson wrote:

> First, don't confuse unicode and utf-8.

Too late ;-) already pitifully confused.

> Second, convert the string to unicode and then title-case it, then convert 
> back to utf-8 if you need to:

I'm having trouble figuring out where, in the context of my code, to 
effect these translations. In parsing the text file, I depend on matching 
a re:

if re.match(r'[A-Z]{2,}', line)

to identify and process the place name data. If I translate the line to 
unicode, the re fails.

The whole program isn't very long, so at the risk of embarrassing myself, 
I'm including the whole ugly, kludgy thing below. I hope I'm not hereby 
violating any conventions of the list. Kent will recognize the ranges() 
function (which works a treat, by the way, and was very instructive, 
thanks).

In addition to the title case problem, if anyone has pointers on how to 
make this look a little less like a Frankenstein's monster (all improbably 
bolted together), such tutelage would be gratefully recieved.

The end product is intended to be a simple xml file and, apart from the 
title case problem it works well enough. A sample of the text file input 
is included at the bottom.

#!/usr/bin/python

import re

input = open('sample.txt', 'r')
text = input.readlines()

months = {'January':1, 'February':2, 'March':3, 'April':4, 'May':5, 
'June':6, 'July':7, 'August':8, 'September':9, 'October':10, 
'November':11, 'December':12}

def ranges(data):
     i = iter(data)
     first = last = i.next()
     try:
         while True:
             next = i.next()
             if next > last+1:
                 yield (first, last)
                 first = last = next
             else:
                 last = next
     except StopIteration:
         yield (first, last)

def parse_month_string(monthstring, year, title):
     res=[]
     monthstring_regex = re.compile('^(\w+)\s+(\d.*)\.$')
     monthstring_elements = monthstring_regex.match(monthstring)
     month = monthstring_elements.group(1)
     days = ranges([int(x) for x in re.split(',', 
monthstring_elements.group(2))])
     for start, end in days:
         if start == end:
             res.append('<event start="%s-%02d-%02d" title="%s" />' % 
(year, months[month], start, title.strip()))
         else:
             res.append('<event start="%s-%02d-%02d" end="%s-%02d-%02d" 
isDuration="true" title="%s" />' % (year, months[month], start, year, 
months[month], end, title.strip()))
     return res

def parse_year_string(yearstring, title):
     res=[]
     yearstring_regex = re.compile('(\d\d\d\d)\.\s+(\w+)\s+(\d.*)\.$')
     yearstring_elements = yearstring_regex.match(yearstring)
     year = yearstring_elements.group(1)
     month = yearstring_elements.group(2)
     days = ranges([int(x) for x in re.split(',', 
yearstring_elements.group(3))])

     for start, end in days:
         if start == end:
             res.append('<event start="%s-%02d-%02d" title="%s" />' % 
(year, months[month], start, title.strip()))
         else:
             res.append('<event start="%s-%02d-%02d" end="%s-%02d-%02d" 
isDuration="true" title="%s" />' % (year, months[month], start, year, 
months[month], end, title.strip()))
     return res

def places(data):
     place=[data[0]]
     for line in data:
         if re.match(r'[A-Z]{2,}', line):
             if place:
                 yield place
                 place = []
             place.append(line.strip())
         elif re.match(r'(\d\d\d\d)\.\s+(\w+)\s+(\d.*)\.$', line):
             yearstring_regex = 
re.compile('(\d\d\d\d)\.\s+(\w+)\s+(\d.*)\.$')
             yearstring_elements = yearstring_regex.match(line)
             year = yearstring_elements.group(1)
             title = place[0]
             place.append(parse_year_string(line, title))
         elif re.match(r'^(\w+)\s+(\d.*)\.$', line):
             place.append(parse_month_string(line, year, title))
     yield place

for x in places(text):
     for y in x[1:]:
         for z in y:
             print z

#############
here begins sample of the text file input:

ABERGAVENNY, Monmouthshire.
1211.    March 12.
ALENÇON, Normandie.
1199.      November 3.
1200.      September 6, 7.
1201.      July 18.
1202.      February 20, 21.
August 8, 9, 10, 12.
September 29.
1202.     October 3, 29.
December 7.
1203.      January 15, 16, 17, 18, 19, 25.
August 11, 12, 13, 14, 15.
ALLERTON, Yorkshire.
1201.     February 28.
1212.   June 29.
September 1, 2, 6.
1213.   February 6, 7.
September 16.
1216.  January 6.
ANDELY (le Petit), Normandie.
1199.       August 18, 19, 28, 29, 30, 31.
September 1.
October 21, 26, 27.
1200.      January 11.
May 11, 12, 17.
May 18, 19, 20, 21, 22,  23, 24, 25, 26.
1201.      June 9, 10, 11, 25, 26, 27.
October 23, 24, 25, 26, 28.
December 15.
1202.      March 27, 28, 29.
April 4, 22, 23, 24, 25, 26, 28.
ANGERS, Anjou.
1200.     June 18, 19, 20, 21.
1202.     September 4, 15.
1206.      September, 8,  9,  10, 11, 12, 13, 20, 21.
1214.  June 17, 18.
ANGOULÊME, Angoumois.
1200.     August 26.
1202.      February 4, 5.
1214.  March 13, 14, 15.
April 5, 6.
July 28, 29, 30.
August 17, 18.
ANVERS-LE-HOMONT, Maine.
1199.     September 18.