[I18n-sig] Autoguessing charset for Unicode strings?

Tim Peters tim.one@home.com
Tue, 19 Jun 2001 20:32:19 -0400


[Machin, John]
> maybe not so expensive, depending on (a) what's in C and what's in
> Python and (b) function call overhead and (c) what proportion of text
> needs which character set ...
>
> loop once through your Unicode;
> 	if there were any chars with ordinal > 255, then use UTF-8
> 	elif there were any > 127, then use iso-8859-1
> 	else use ASCII

I don't know whether that algorithm makes sense, but it's efficient enough
in Python:

    biggest = max(map(ord, some_unicode_string))
    if biggest > 255:
        yadda
    elif biggest > 127:
        yadda
    else:
        yadda

So the bulk of the work goes almost entirely at C speed.