RE Module Performance
wxjmfauth at gmail.com
wxjmfauth at gmail.com
Thu Jul 25 15:07:45 EDT 2013
Le jeudi 25 juillet 2013 12:14:46 UTC+2, Chris Angelico a écrit :
> On Thu, Jul 25, 2013 at 7:27 PM, <wxjmfauth at gmail.com> wrote:
>
> > A coding scheme works with a unique set of characters (the repertoire),
>
> > and the implementation (the programming) works with a unique set
>
> > of encoded code points. The critical step is the path
>
> > {unique set of characters} <--> {unique set of encoded code points}
>
>
>
> That's called Unicode. It maps the character 'A' to the code point
>
> U+0041 and so on. Code points are integers. In fact, they are very
>
> well represented in Python that way (also in Pike, fwiw):
>
>
>
> >>> ord('A')
>
> 65
>
> >>> chr(65)
>
> 'A'
>
> >>> chr(123456)
>
> '\U0001e240'
>
> >>> ord(_)
>
> 123456
>
>
>
> > In the byte string world, this step is a no-op.
>
> >
>
> > In Unicode, it is exactly the purpose of a "utf" to achieve this
>
> > step. "utf": a confusing name covering at the same time the
>
> > process and the result of the process.
>
> > A "utf chunk", a series of bits (not bytes), hold intrisically
>
> > the information about the character it is representing.
>
>
>
> No, now you're looking at another level: how to store codepoints in
>
> memory. That demands that they be stored as bits and bytes, because PC
>
> memory works that way.
>
>
>
> > utf32: as a pointed many times. You are already using it (maybe
>
> > without knowing it). Where? in fonts (OpenType technology),
>
> > rendering engines, pdf files. Why? Because there is not other
>
> > way to do it better.
>
>
>
> And UTF-32 is an excellent system... as long as you're okay with
>
> spending four bytes for every character.
>
>
>
> > See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0
>
>
>
> I refuse to click this link. Give us a link to the
>
> python-list at python.org archive, or gmane, or something else more
>
> suited to the audience. I'm not going to Google Groups just to figure
>
> out what you're saying.
>
>
>
> > If you are not understanding my "editor" analogy. One other
>
> > proposed exercise. Build/create a flexible iso-8859-X coding
>
> > scheme. You will quickly understand where the bottleneck
>
> > is.
>
> > Two working ways:
>
> > - stupidly with an editor and your fingers.
>
> > - lazily with a sheet of paper and you head.
>
>
>
> What has this to do with the editor?
>
>
>
> > There is a clear difference between FSR and ucs-4/utf32.
>
>
>
> Yes. Memory usage. PEP 393 strings might take up half or even a
>
> quarter of what they'd take up in fixed UTF-32. Other than that,
>
> there's no difference.
>
>
>
> ChrisA
--------
Let start with a simple string \textemdash or \texttendash
>>> sys.getsizeof('–')
40
>>> sys.getsizeof('a')
26
jmf
jmf
More information about the Python-list
mailing list