Micro Python -- a lean and efficient implementation of Python 3

Wed Jun 11 03:40:23 EDT 2014

Le mardi 10 juin 2014 21:43:13 UTC+2, alister a écrit :
> On Tue, 10 Jun 2014 12:27:26 -0700, wxjmfauth wrote:
> 
> 
> 
> > Le samedi 7 juin 2014 04:20:22 UTC+2, Tim Chase a écrit :
> 
> >> On 2014-06-06 09:59, Travis Griggs wrote:
> 
> >> 
> 
> >> > On Jun 4, 2014, at 4:01 AM, Tim Chase wrote:
> 
> >> 
> 
> >> > > If you use UTF-8 for everything
> 
> >> 
> 
> >> 
> 
> >> > 
> 
> >> > It seems to me, that increasingly other libraries (C, etc), use
> 
> >> 
> 
> >> > utf8 as the preferred string interchange format.
> 
> >> 
> 
> >> 
> 
> >> 
> 
> >> I definitely advocate UTF-8 for any streaming scenario, as you're
> 
> >> 
> 
> >> iterating unidirectionally over the data anyways, so why use/transmit
> 
> >> 
> 
> >> more bytes than needed.  The only failing of UTF-8 that I've found in
> 
> >> 
> 
> >> the real world(*) is when you have to requirement of constant-time
> 
> >> 
> 
> >> indexing into strings.
> 
> >> 
> 
> >> 
> 
> >> 
> 
> >> -tkc
> 
> > 
> 
> > And once again, just an illustration,
> 
> > 
> 
> >>>> timeit.repeat("(x*1000 + y)", setup="x = 'abc'; y = 'z'")
> 
> > [0.9457552436453511, 0.9190932610143818, 0.9322044912393039]
> 
> >>>> timeit.repeat("(x*1000 + y)", setup="x = 'abc'; y = '\u0fce'")
> 
> > [2.5541921791045183, 2.52434366066052, 2.5337417948967413]
> 
> >>>> timeit.repeat("(x*1000 + y)", setup="x = 'abc'.encode('utf-8'); y =
> 
> >>>> 'z'.encode('utf-8')")
> 
> > [0.9168235779232532, 0.8989583403075017, 0.8964204541650247]
> 
> >>>> timeit.repeat("(x*1000 + y)", setup="x = 'abc'.encode('utf-8'); y =
> 
> >>>> '\u0fce'.encode('utf-8')")
> 
> > [0.9320969737165115, 0.9086006535332558, 0.9051715140790861]
> 
> >>>> 
> 
> >>>> 
> 
> >>>> sys.getsizeof('abc'*1000 + '\u0fce')
> 
> > 6040
> 
> >>>> sys.getsizeof(('abc'*1000 + '\u0fce').encode('utf-8'))
> 
> > 3020
> 
> >>>>
> 
> >>>>
> 
> > 
> 
> > But you know, that's not the problem.
> 
> > 
> 
> > When a see a core developper discussing benchmarking,
> 
> > when the same application using non ascii chars become 1, 2, 5, 10, 20
> 
> > if not more, slower comparing to pure ascii, I'm wondering if there is
> 
> > not a serious problem somewhere.
> 
> > 
> 
> > (and also becoming slower that Py3.2)
> 
> > 
> 
> > BTW, very easy to explain.
> 
> > 
> 
> > I do not understand why the "free, open, what-you-wish-here, ... "
> 
> > software is so often pushing to the adoption of serious corporate
> 
> > products.
> 
> > 
> 
> > jmf
> 
> 
> 
> Your error reports always seem to resolve around benchmarks despite speed 
> 
> not being one of Pythons prime objectives
> 
> 
> 
> Computers store data using bytes
> 
> ASCII Characters can be used storing a single byte
> 
> Unicode code-points cannot be stored in a single byte
> 
> therefore Unicode will always be inherently slower than ASCII
> 
> 
> 
> implementation details mean that some Unicode characters may be handled 
> 
> more efficiently than others, why is this wrong?
> 
> why should all Unicode operations be equally slow?
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> There isn't any problem

%%%%%%%%

The point is elsewhere.

1) In unicode, ascii does not exist. "ascii" only means a
reference to the "characters" of the ascii coding scheme.

2) Python 32, 33, 34.

Python 33 is optimizing ascii comparing to py32 (memcopy, ...),
but if the gain of the performance level is, let say, a factor
n, the loss of perfomance in non-ascii range is let say
(m * n) with m >>> 1.

Comparing 33 and 34 is very interesting. A lot of work
has been done, but what has been gained in some "methods",
has been lost, couterbalanced, on other sides. What is wrong
by design will always stay wrong by design. (I patiently
waited for py34, and what I expected just happend!).

Again an *illustration* (with a BDFL example who is not
happy about the Python performance!).
The following summarizes a little bit the situation.

py32:

>>> timeit.timeit("a = 'hundred'; 'x' in a")
0.09113662222722835
>>> timeit.timeit("a = 'hundre EURO'; 'x' in a")
0.1029297261915687

py33:
timeit.timeit("a = 'hundred'; 'x' in a")
0.12081905832395669
timeit.timeit("a = 'hundre EURO'; 'x' in a")
0.2453480765512026

Ditto for py34

Not only py33+ is worth than py33-. The situation
is even "more worse than worth" with non ascii chars.

The memory situation is not better.

py33
>>> sys.getsizeof('a')
26
>>> sys.getsizeof('EURO')
40
>>> sys.getsizeof('\U00010000')
44

This is very easy to explain with a sheet of paper
and a pencil (should I write a blackboard and
a piece of chalk?).

----

I'm still using Python, it just becomes the best tool
to illustrate unicode! (when it is not failing or
crashing).

Do not blame me, if I do not recommend Python and
if I'm using Python a demonstration tool, it is
impossible to find something worse when it comes
to unicode handling.

There are plenty of other very bad side effects
(prerequisite stuff: a good undestanding of the
coding of characters and unicode).

When I see all these links pointing to wikipedia
or other sites, and practically never on the
unicode.org ...

I'm optimistic, py devs will never put their
fingers in a TeX unicode engine, how could
it be? ;-)

I'm observing all this stuff from a unicode perspective.
Nothing wrong. Hobbyist tools will always stay hobbyist
tools.

jmf