[Python-Dev] PEP 277 (unicode filenames): please review

14 Aug 2002 20:35:41 +0200

Jack Jansen <Jack.Jansen@oratrix.com> writes:

> Why is this hard work? I would guess that a simple table lookup would
> suffice, after all there are only a finite number of unicode
> characters that can be split up, and each one can be split up in only
> a small number of ways.

Canonical decomposition requires more than that: you not only need to
apply the canonical decomposition mapping, but also need to put the
resulting characters into canonical order (if more than one combining
character applies to a base character).

In addition, a na=EFve implementation will consume large amounts of
memory. Hangul decomposition is better done algorithmitically, as we
are talking about 11172 precombined characters for Hangul alone.

> Wouldn't something like
> for c in input:
> 	if not canbestartofcombiningsequence.has_key(c):
> 		output.append(c)
>       nlookahead =3D MAXCHARSTOCOMBINE
>       while nlookahead > 1:
> 		attempt =3D lookahead next nlookahead bytes from input
> 		if combine.has_key(attempt):
> 			output.append(combine[attempt])
> 			skip the lookahead in input
> 			break
> 	else:
> 		output.append(c)
> do the trick, if the two dictionaries are initialized intelligently?

No, that doesn't do canonical ordering. There is a lot more to
normalization; the hard work is really in understanding what has to be
done.

Regards,
Martin