[Python-Dev] PEP 393 Summer of Code Project

Antoine Pitrou solipsis at pitrou.net
Tue Aug 23 13:39:02 CEST 2011


> >> Is it really extremely common to have strings that are mostly-ASCII but
> >> not completely ASCII? I would agree that pure ASCII strings are
> >> extremely common.
> > Mostly ascii is pretty common for western-european languages (French, for
> > instance, is probably 90 to 95% ascii). It's also a risk in english, when
> > the writer "correctly" spells foreign words (résumé and the like).
> 
> I know - I still question whether it is "extremely common" (so much as
> to justify a special case).

Well, it's:
- all natural languages based on a variant of the latin alphabet
- but also, XML, JSON, HTML documents...
- and log files...
- in short, any kind of parsable format which is structurally ASCII but
and can contain arbitrary unicode

So I would say *most* unicode data out there is mostly-ASCII, even when
it has Japanese characters in it. The rationale is that most unicode
data processed by computers is structured.

This optimization was done when trying to improve the speed of text I/O.

> In the PEP 393 approach, if the string has a two-byte representation,
> each character needs to widened to two bytes, and likewise for four
> bytes. So three separate copies of the unrolled loop would be needed,
> one for each target size.

Do you have three copies of the UTF-8 decoder already, or do you a use a
stringlib-like approach?

Regards

Antoine.




More information about the Python-Dev mailing list