RE Module Performance

Thu Jul 25 15:07:45 EDT 2013

Le jeudi 25 juillet 2013 12:14:46 UTC+2, Chris Angelico a écrit :
> On Thu, Jul 25, 2013 at 7:27 PM,  <wxjmfauth at gmail.com> wrote:
> 
> > A coding scheme works with a unique set of characters (the repertoire),
> 
> > and the implementation (the programming) works with a unique set
> 
> > of encoded code points. The critical step is the path
> 
> > {unique set of characters} <--> {unique set of encoded code points}
> 
> 
> 
> That's called Unicode. It maps the character 'A' to the code point
> 
> U+0041 and so on. Code points are integers. In fact, they are very
> 
> well represented in Python that way (also in Pike, fwiw):
> 
> 
> 
> >>> ord('A')
> 
> 65
> 
> >>> chr(65)
> 
> 'A'
> 
> >>> chr(123456)
> 
> '\U0001e240'
> 
> >>> ord(_)
> 
> 123456
> 
> 
> 
> > In the byte string world, this step is a no-op.
> 
> >
> 
> > In Unicode, it is exactly the purpose of a "utf" to achieve this
> 
> > step. "utf": a confusing name covering at the same time the
> 
> > process and the result of the process.
> 
> > A "utf chunk", a series of bits (not bytes), hold intrisically
> 
> > the information about the character it is representing.
> 
> 
> 
> No, now you're looking at another level: how to store codepoints in
> 
> memory. That demands that they be stored as bits and bytes, because PC
> 
> memory works that way.
> 
> 
> 
> > utf32: as a pointed many times. You are already using it (maybe
> 
> > without knowing it). Where? in fonts (OpenType technology),
> 
> > rendering engines, pdf files. Why? Because there is not other
> 
> > way to do it better.
> 
> 
> 
> And UTF-32 is an excellent system... as long as you're okay with
> 
> spending four bytes for every character.
> 
> 
> 
> > See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0
> 
> 
> 
> I refuse to click this link. Give us a link to the
> 
> python-list at python.org archive, or gmane, or something else more
> 
> suited to the audience. I'm not going to Google Groups just to figure
> 
> out what you're saying.
> 
> 
> 
> > If you are not understanding my "editor" analogy. One other
> 
> > proposed exercise. Build/create a flexible iso-8859-X coding
> 
> > scheme. You will quickly understand where the bottleneck
> 
> > is.
> 
> > Two working ways:
> 
> > - stupidly with an editor and your fingers.
> 
> > - lazily with a sheet of paper and you head.
> 
> 
> 
> What has this to do with the editor?
> 
> 
> 
> > There is a clear difference between FSR and ucs-4/utf32.
> 
> 
> 
> Yes. Memory usage. PEP 393 strings might take up half or even a
> 
> quarter of what they'd take up in fixed UTF-32. Other than that,
> 
> there's no difference.
> 
> 
> 
> ChrisA

--------


Let start with a simple string \textemdash or \texttendash

>>> sys.getsizeof('–')
40
>>> sys.getsizeof('a')
26

jmf

jmf