[pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage
hubo
hubo at jiedaibao.com
Mon Mar 7 07:49:24 EST 2016
Thanks for the link!
It is interesting that in Python3.5, still
>>> len(u'\ud805\udc09')
2
>>> u'\ud805\udc09' == u'\U00011409'
False
I think in Python 3.x, u'\ud805\udc09' is not another format of u'\U00011409', it is just an illegal unicode string. It also raises UnicodeEncodeError if you try to encode it into UTF-8. The problem is that it is legal to define and use these strings. If PyPy uses UTF-8 or UTF-16 as the internal storage format, I don't think it is possible to keep these details same as CPython, but it should be acceptable.
Thanks again for the discussion. Unicode is really complicated.
2016-03-07
hubo
发件人:Steven D'Aprano <steve at pearwood.info>
发送时间:2016-03-07 19:45
主题:Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage
收件人:"pypy-dev"<pypy-dev at python.org>
抄送:
On Mon, Mar 07, 2016 at 11:31:10AM +0200, Maciej Fijalkowski wrote:
> I think you're misunderstanding what we're proposing.
>
> We're proposing utf8 representation completely hidden from the user,
> where everything behaves just like cpython unicode (the len() example
> you're showing is a narrow unicode build I presume?)
Yes, CPython narrow builds don't handle Unicode code points in the
supplementary planes well: they wrongly return len(2) for code points
with a 4-byte UTF-16 representation:
steve at runes:~$ python2.6 -c "print len(u'\U0010FFFF')" # wide build
1
steve at runes:~$ python2.7 -c "print len(u'\U0010FFFF')" # narrow build
2
That is no longer the case since Python 3.3, when the "flexible
string representation" was introduced.
https://www.python.org/dev/peps/pep-0393/
I think that it would be a very valuable experiment for PyPy to
investigate moving to a UTF-8 internal representation.
--
Steve
_______________________________________________
pypy-dev mailing list
pypy-dev at python.org
https://mail.python.org/mailman/listinfo/pypy-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20160307/57bbe216/attachment.html>
More information about the pypy-dev
mailing list