Flexible string representation, unicode, typography, ...

Wed Aug 29 07:40:46 EDT 2012

Le mercredi 29 août 2012 06:16:05 UTC+2, Ian a écrit :
> On Tue, Aug 28, 2012 at 8:42 PM, rusi <rustompmody at gmail.com> wrote:
> 
> > In summary:
> 
> > 1. The problem is not on jmf's computer
> 
> > 2. It is not windows-only
> 
> > 3. It is not directly related to latin-1 encodable or not
> 
> >
> 
> > The only question which is not yet clear is this:
> 
> > Given a typical string operation that is complexity O(n), in more
> 
> > detail it is going to be O(a + bn)
> 
> > If only a is worse going 3.2 to 3.3, it may be a small issue.
> 
> > If b is worse by even a tiny amount, it is likely to be a significant
> 
> > regression for some use-cases.
> 
> 
> 
> As has been pointed out repeatedly already, this is a microbenchmark.
> 
> jmf is focusing in one one particular area (string construction) where
> 
> Python 3.3 happens to be slower than Python 3.2, ignoring the fact
> 
> that real code usually does lots of things other than building
> 
> strings, many of which are slower to begin with.  In the real-world
> 
> benchmarks that I've seen, 3.3 is as fast as or faster than 3.2.
> 
> Here's a much more realistic benchmark that nonetheless still focuses
> 
> on strings: word counting.
> 
> 
> 
> Source: http://pastebin.com/RDeDsgPd
> 
> 
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc"
> 
> "wc.wc('unilang8.htm')"
> 
> 1000 loops, best of 3: 310 usec per loop
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc"
> 
> "wc.wc('unilang8.htm')"
> 
> 1000 loops, best of 3: 302 usec per loop
> 
> 
> 
> "unilang8.htm" is an arbitrary UTF-8 document containing a broad swath
> 
> of Unicode characters that I pulled off the web.  Even though this
> 
> program is still mostly string processing, Python 3.3 wins.  Of
> 
> course, that's not really a very good test -- since it reads the file
> 
> on every pass, it probably spends more time in I/O than it does in
> 
> actual processing.  Let's try it again with prepared string data:
> 
> 
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
> 
> open('unilang8.htm', 'r', encoding
> 
> ='utf-8').read()" "wc.wc_str(t)"
> 
> 10000 loops, best of 3: 87.3 usec per loop
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
> 
> open('unilang8.htm', 'r', encoding
> 
> ='utf-8').read()" "wc.wc_str(t)"
> 
> 10000 loops, best of 3: 84.6 usec per loop
> 
> 
> 
> Nope, 3.3 still wins.  And just for the sake of my own curiosity, I
> 
> decided to try it again using str.split() instead of a StringIO.
> 
> Since str.split() creates more strings, I expect Python 3.2 might
> 
> actually win this time.
> 
> 
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
> 
> open('unilang8.htm', 'r', encoding
> 
> ='utf-8').read()" "wc.wc_split(t)"
> 
> 10000 loops, best of 3: 88 usec per loop
> 
> 
> 
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
> 
> open('unilang8.htm', 'r', encoding
> 
> ='utf-8').read()" "wc.wc_split(t)"
> 
> 10000 loops, best of 3: 76.5 usec per loop
> 
> 
> 
> Interestingly, although Python 3.2 performs the splits in about the
> 
> same time as the StringIO operation, Python 3.3 is significantly
> 
> *faster* using str.split(), at least on this data set.
> 
> 
> 
> 
> 
> > So doing some arm-chair thinking (I dont know the code and difficulty
> 
> > involved):
> 
> >
> 
> > Clearly there are 3 string-engines in the python 3 world:
> 
> > - 3.2 narrow
> 
> > - 3.2 wide
> 
> > - 3.3 (flexible)
> 
> >
> 
> > How difficult would it be to giving the choice of string engine as a
> 
> > command-line flag?
> 
> > This would avoid the nuisance of having two binaries -- narrow and
> 
> > wide.
> 
> 
> 
> Quite difficult.  Even if we avoid having two or three separate
> 
> binaries, we would still have separate binary representations of the
> 
> string structs.  It makes the maintainability of the software go down
> 
> instead of up.
> 
> 
> 
> > And it would give the python programmer a choice of efficiency
> 
> > profiles.
> 
> 
> 
> So instead of having just one test for my Unicode-handling code, I'll
> 
> now have to run that same test *three times* -- once for each possible
> 
> string engine option.  Choice isn't always a good thing.
> 
> 

Forget Python and all these benchmarks. The problem
is on an other level. Coding schemes, typography,
usage of characters, ...

For a given coding scheme, all code points/characters are
equivalent. Expecting to handle a sub-range in a coding
scheme without shaking that coding scheme is impossible.

If a coding scheme does not give satisfaction, the only
valid solution is to create a new coding scheme, cp1252,
mac-roman, EBCDIC, ... or the interesting "TeX" case, where
the "internal" coding depends on the fonts!

Unicode (utf***), as just one another coding scheme, does
not escape to this rule.

This "Flexible String Representation" fails. Not only
it is unable to stick with a coding scheme, it is
a mixing of coding schemes, the worst of all possible
implementations.

jmf