Blog "about python 3"

Sat Jan 4 08:52:20 EST 2014

Le vendredi 3 janvier 2014 12:14:41 UTC+1, Robin Becker a écrit :
> On 02/01/2014 18:37, Terry Reedy wrote:
> 
> > On 1/2/2014 12:36 PM, Robin Becker wrote:
> 
> >
> 
> >> I just spent a large amount of effort porting reportlab to a version
> 
> >> which works with both python2.7 and python3.3. I have a large number of
> 
> >> functions etc which handle the conversions that differ between the two
> 
> >> pythons.
> 
> >
> 
> > I am imagine that this was not fun.
> 
> 
> 
> indeed :)
> 
> >
> 
> >> For fairly sensible reasons we changed the internal default to use
> 
> >> unicode rather than bytes.
> 
> >
> 
> > Do you mean 'from __future__ import unicode_literals'?
> 
> 
> 
> No, previously we had default of utf8 encoded strings in the lower levels of the 
> 
> code and we accepted either unicode or utf8 string literals as inputs to text 
> 
> functions. As part of the port process we made the decision to change from 
> 
> default utf8 str (bytes) to default unicode.
> 
> 
> 
> > Am I correct in thinking that this change increases the capabilities of
> 
> > reportlab? For instance, easily producing an article with abstracts in English,
> 
> > Arabic, Russian, and Chinese?
> 
> >
> 
> It's made no real difference to what we are able to produce or accept since utf8 
> 
> or unicode can encode anything in the input and what can be produced depends on 
> 
> fonts mainly.
> 
> 
> 
> >  > After doing all that and making the tests
> 
> ...........
> 
> >> I know some of these tests are fairly variable, but even for simple
> 
> >> things like paragraph parsing 3.3 seems to be slower. Since both use
> 
> >> unicode internally it can't be that can it, or is python 2.7's unicode
> 
> >> faster?
> 
> >
> 
> > The new unicode implementation in 3.3 is faster for some operations and slower
> 
> > for others. It is definitely more space efficient, especially compared to a wide
> 
> > build system. It is definitely less buggy, especially compared to a narrow build
> 
> > system.
> 
> >
> 
> > Do your tests use any astral (non-BMP) chars? If so, do they pass on narrow 2.7
> 
> > builds (like on Windows)?
> 
> 
> 
> I'm not sure if we have any non-bmp characters in the tests. Simple CJK etc etc 
> 
> for the most part. I'm fairly certain we don't have any ability to handle 
> 
> composed glyphs (multi-codepoint) etc etc
> 
> 
> 
> 
> 
> 
> 
> ....
> 
> > For one thing, indexing and slicing just works on all machines for all unicode
> 
> > strings. Code for 2.7 and 3.3 either a) does not index or slice, b) does not
> 
> > work for all text on 2.7 narrow builds, or c) has extra conditional code only
> 
> > for 2.7.

----

To Robin Becker

I know nothing about ReportLab except its existence.
Your story is very interesting. As I pointed, I know
nothing about the internal of ReportLab, the technical
aspects: the "Python part", "the used api for the PDF creation").
I have however some experience with the unicode TeX engine,
XeTeX, understand I'm understanding a little bit what's
happening behind the scene.

The very interesting aspect in the way you are holding
unicodes (strings). By comparing Python 2 with Python 3.3,
you are comparing utf-8 with the the internal "representation"
of Python 3.3 (the flexible string represenation).
In one sense, more than comparing Py2 with Py3.

It will be much more interesting to compare utf-8/Python
internals at the light of Python 3.2 and Python 3.3. Python
3.2 has a decent unicode handling, Python 3.3 has an absurd
(in mathematical sense) unicode handling. This is really
shining with utf-8, where this flexible string representation
is just doing the opposite of what a correct unicode
implementation does!

On the memory side, it is obvious to see it.

>>> sys.getsizeof('a'*10000 + 'z')
10026
>>> sys.getsizeof('a'*10000 + '€')
20040
>>> sys.getsizeof(('a'*10000 + 'z').encode('utf-8'))
10018
>>> sys.getsizeof(('a'*10000 + '€').encode('utf-8'))
10020

On the performance side, it is much more complexe,
but qualitatively, you may expect the same results.

The funny aspect is that by working with utf-8 in that
case, you are (or one has) forcing Python to work
properly, but one pays on the side of the performance.
And if one wishes to save memory, one has to pay on the
side of performance.

In othe words, attempting to do what Python is
not able to do natively is just impossible!

I'm skipping the very interesting composed glyphs subject
(unicode normalization, ...),  but I wish to point that
with the flexible string representation, one reaches
the top level of surrealism. For a tool which is supposed
to handle these very specific unicode tasks...

jmf