ascii to latin1

Richie Hindle richie at entrian.com
Tue May 9 09:48:09 EDT 2006


[Serge]
> I have to admit that using
> normalize is a far from perfect way to  implement search. The most
> advanced algorithm is published by Unicode guys:
> <http://www.unicode.org/reports/tr10/> If you read it you'll understand
> it's not so easy.

I only have to look at the length of the document to understand it's not
so easy.  8-)  I'll take your two-line normalization function any day.

> IMHO It is perfectly acceptable to declare you don't interpret those
> symbols.  After all they are called *compatibility* code points. I
> tried "a quater" symbol: Google and MSN don't interpret it. Yahoo
> doesn't support it at all. [...]
> if you have character "digit two" followed by "superscript
> digit two"; they look like 2 power 2, but NFKD will convert them into
> 22 (twenty two), which is wrong. So if you want to use NFKD for search
> your will have to preprocess your data, for example inserting space
> between the twos.

I'm not sure it's obvious that it's wrong.  How might a user enter
"2<superscript digit 2>" into a search box?  They might enter a genuine
"<superscript digit 2>" in which case you're fine, or they might enter
"2^2" in which case it depends how you deal with punctuation.  They
probably won't enter "2 2".

It's certainly not wrong in the case of ligatures like LATIN SMALL
LIGATURE FI - it's quite likely that the user will search for "fish"
rather than finding and (somehow) typing the ligature.

Some superscripts are similar - I imagine there's a code point for the
"superscript st" in "1st" (though I can't find it offhand) and you'd
definitely want to convert that to "st".

NFKD normalization doesn't convert VULGAR FRACTION ONE QUARTER into
"1/4" - I wonder whether there's some way to do that?

> After all they are called *compatibility* code points.

Yes, compatible with what the user types.  8-)

-- 
Richie Hindle
richie at entrian.com



More information about the Python-list mailing list