RE Module Performance

wxjmfauth at gmail.com wxjmfauth at gmail.com
Fri Jul 26 09:19:42 EDT 2013


Le jeudi 25 juillet 2013 22:45:38 UTC+2, Ian a écrit :
> On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano
> 
> <steve+comp.lang.python at pearwood.info> wrote:
> 
> > On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:
> 
> >
> 
> >> On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
> 
> >> <steve+comp.lang.python at pearwood.info> wrote:
> 
> >>> On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
> 
> >>>> "To conserve memory, Emacs does not hold fixed-length 22-bit numbers
> 
> >>>> that are codepoints of text characters within buffers and strings.
> 
> >>>> Rather, Emacs uses a variable-length internal representation of
> 
> >>>> characters, that stores each character as a sequence of 1 to 5 8-bit
> 
> >>>> bytes, depending on the magnitude of its codepoint[1]. For example,
> 
> >>>> any ASCII character takes up only 1 byte, a Latin-1 character takes up
> 
> >>>> 2 bytes, etc. We call this representation of text multibyte.
> 
> >>>
> 
> >>> Well, you've just proven what Vim users have always suspected: Emacs
> 
> >>> doesn't really exist.
> 
> >>
> 
> >> ... lolwut?
> 
> >
> 
> >
> 
> > JMF has explained that it is impossible, impossible I say!, to write an
> 
> > editor using a flexible string representation. Since Emacs uses such a
> 
> > flexible string representation, Emacs is impossible, and therefore Emacs
> 
> > doesn't exist.
> 
> >
> 
> > QED.
> 
> 
> 
> Except that the described representation used by Emacs is a variant of
> 
> UTF-8, not an FSR.  It doesn't have three different possible encodings
> 
> for the letter 'a' depending on what other characters happen to be in
> 
> the string.
> 
> 
> 
> As I understand it, jfm would be perfectly happy if Python used UTF-8
> 
> (or presumably the Emacs variant) as its internal string
> 
> representation.

------

And emacs it probably working smoothly.

Your comment summarized all this stuff very correctly and
very shortly.

utf8/16/32? I do not care. There are all working correctly,
smoothly and efficiently. In fact, these utf's are already
doing correctly, what this FSR is doing in a wrong way.

My preference? utf32. Why? It is the most simple and
consequently performing choice. I'm not a narrow minded
ascii user. (I do not pretend to belong to those who
are solving the quadrature of the circle, I pretend to
belong to those who know, the quadrature of the circle
is not solvable).

Note: text processing tools or tools that have to process
characters — and the tools to build these tools — are all
moving to utf32, if not already done. There are technical
reasons behind this, which are going beyond the
pure raw unicode. There are however still 100% Unicode
compliant.

jmf



More information about the Python-list mailing list