RE Module Performance

Thu Jul 25 06:14:46 EDT 2013

On Thu, Jul 25, 2013 at 7:27 PM,  <wxjmfauth at gmail.com> wrote:
> A coding scheme works with a unique set of characters (the repertoire),
> and the implementation (the programming) works with a unique set
> of encoded code points. The critical step is the path
> {unique set of characters} <--> {unique set of encoded code points}

That's called Unicode. It maps the character 'A' to the code point
U+0041 and so on. Code points are integers. In fact, they are very
well represented in Python that way (also in Pike, fwiw):

>>> ord('A')
65
>>> chr(65)
'A'
>>> chr(123456)
'\U0001e240'
>>> ord(_)
123456

> In the byte string world, this step is a no-op.
>
> In Unicode, it is exactly the purpose of a "utf" to achieve this
> step. "utf": a confusing name covering at the same time the
> process and the result of the process.
> A "utf chunk", a series of bits (not bytes), hold intrisically
> the information about the character it is representing.

No, now you're looking at another level: how to store codepoints in
memory. That demands that they be stored as bits and bytes, because PC
memory works that way.

> utf32: as a pointed many times. You are already using it (maybe
> without knowing it). Where? in fonts (OpenType technology),
> rendering engines, pdf files. Why? Because there is not other
> way to do it better.

And UTF-32 is an excellent system... as long as you're okay with
spending four bytes for every character.

> See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0

I refuse to click this link. Give us a link to the
python-list at python.org archive, or gmane, or something else more
suited to the audience. I'm not going to Google Groups just to figure
out what you're saying.

> If you are not understanding my "editor" analogy. One other
> proposed exercise. Build/create a flexible iso-8859-X coding
> scheme. You will quickly understand where the bottleneck
> is.
> Two working ways:
> - stupidly with an editor and your fingers.
> - lazily with a sheet of paper and you head.

What has this to do with the editor?

> There is a clear difference between FSR and ucs-4/utf32.

Yes. Memory usage. PEP 393 strings might take up half or even a
quarter of what they'd take up in fixed UTF-32. Other than that,
there's no difference.

ChrisA