Flexible string representation, unicode, typography, ...
wxjmfauth at gmail.com
wxjmfauth at gmail.com
Wed Aug 29 07:40:46 EDT 2012
Le mercredi 29 août 2012 06:16:05 UTC+2, Ian a écrit :
> On Tue, Aug 28, 2012 at 8:42 PM, rusi <rustompmody at gmail.com> wrote:
>
> > In summary:
>
> > 1. The problem is not on jmf's computer
>
> > 2. It is not windows-only
>
> > 3. It is not directly related to latin-1 encodable or not
>
> >
>
> > The only question which is not yet clear is this:
>
> > Given a typical string operation that is complexity O(n), in more
>
> > detail it is going to be O(a + bn)
>
> > If only a is worse going 3.2 to 3.3, it may be a small issue.
>
> > If b is worse by even a tiny amount, it is likely to be a significant
>
> > regression for some use-cases.
>
>
>
> As has been pointed out repeatedly already, this is a microbenchmark.
>
> jmf is focusing in one one particular area (string construction) where
>
> Python 3.3 happens to be slower than Python 3.2, ignoring the fact
>
> that real code usually does lots of things other than building
>
> strings, many of which are slower to begin with. In the real-world
>
> benchmarks that I've seen, 3.3 is as fast as or faster than 3.2.
>
> Here's a much more realistic benchmark that nonetheless still focuses
>
> on strings: word counting.
>
>
>
> Source: http://pastebin.com/RDeDsgPd
>
>
>
>
>
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc"
>
> "wc.wc('unilang8.htm')"
>
> 1000 loops, best of 3: 310 usec per loop
>
>
>
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc"
>
> "wc.wc('unilang8.htm')"
>
> 1000 loops, best of 3: 302 usec per loop
>
>
>
> "unilang8.htm" is an arbitrary UTF-8 document containing a broad swath
>
> of Unicode characters that I pulled off the web. Even though this
>
> program is still mostly string processing, Python 3.3 wins. Of
>
> course, that's not really a very good test -- since it reads the file
>
> on every pass, it probably spends more time in I/O than it does in
>
> actual processing. Let's try it again with prepared string data:
>
>
>
>
>
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
>
> open('unilang8.htm', 'r', encoding
>
> ='utf-8').read()" "wc.wc_str(t)"
>
> 10000 loops, best of 3: 87.3 usec per loop
>
>
>
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
>
> open('unilang8.htm', 'r', encoding
>
> ='utf-8').read()" "wc.wc_str(t)"
>
> 10000 loops, best of 3: 84.6 usec per loop
>
>
>
> Nope, 3.3 still wins. And just for the sake of my own curiosity, I
>
> decided to try it again using str.split() instead of a StringIO.
>
> Since str.split() creates more strings, I expect Python 3.2 might
>
> actually win this time.
>
>
>
>
>
> C:\Users\Ian\Desktop>c:\python32\python -m timeit -s "import wc; t =
>
> open('unilang8.htm', 'r', encoding
>
> ='utf-8').read()" "wc.wc_split(t)"
>
> 10000 loops, best of 3: 88 usec per loop
>
>
>
> C:\Users\Ian\Desktop>c:\python33\python -m timeit -s "import wc; t =
>
> open('unilang8.htm', 'r', encoding
>
> ='utf-8').read()" "wc.wc_split(t)"
>
> 10000 loops, best of 3: 76.5 usec per loop
>
>
>
> Interestingly, although Python 3.2 performs the splits in about the
>
> same time as the StringIO operation, Python 3.3 is significantly
>
> *faster* using str.split(), at least on this data set.
>
>
>
>
>
> > So doing some arm-chair thinking (I dont know the code and difficulty
>
> > involved):
>
> >
>
> > Clearly there are 3 string-engines in the python 3 world:
>
> > - 3.2 narrow
>
> > - 3.2 wide
>
> > - 3.3 (flexible)
>
> >
>
> > How difficult would it be to giving the choice of string engine as a
>
> > command-line flag?
>
> > This would avoid the nuisance of having two binaries -- narrow and
>
> > wide.
>
>
>
> Quite difficult. Even if we avoid having two or three separate
>
> binaries, we would still have separate binary representations of the
>
> string structs. It makes the maintainability of the software go down
>
> instead of up.
>
>
>
> > And it would give the python programmer a choice of efficiency
>
> > profiles.
>
>
>
> So instead of having just one test for my Unicode-handling code, I'll
>
> now have to run that same test *three times* -- once for each possible
>
> string engine option. Choice isn't always a good thing.
>
>
Forget Python and all these benchmarks. The problem
is on an other level. Coding schemes, typography,
usage of characters, ...
For a given coding scheme, all code points/characters are
equivalent. Expecting to handle a sub-range in a coding
scheme without shaking that coding scheme is impossible.
If a coding scheme does not give satisfaction, the only
valid solution is to create a new coding scheme, cp1252,
mac-roman, EBCDIC, ... or the interesting "TeX" case, where
the "internal" coding depends on the fonts!
Unicode (utf***), as just one another coding scheme, does
not escape to this rule.
This "Flexible String Representation" fails. Not only
it is unable to stick with a coding scheme, it is
a mixing of coding schemes, the worst of all possible
implementations.
jmf
More information about the Python-list
mailing list