[pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

hubo hubo at jiedaibao.com
Mon Mar 7 07:49:24 EST 2016


Thanks for the link!

It is interesting that in Python3.5, still

>>> len(u'\ud805\udc09')
2
>>> u'\ud805\udc09' == u'\U00011409'
False 

I think in Python 3.x, u'\ud805\udc09' is not another format of u'\U00011409', it is just an illegal unicode string. It also raises UnicodeEncodeError if you try to encode it into UTF-8. The problem is that it is legal to define and use these strings. If PyPy uses UTF-8 or UTF-16 as the internal storage format, I don't think it is possible to keep these details same as CPython, but it should be acceptable.

Thanks again for the discussion. Unicode is really complicated.

2016-03-07 

hubo 



发件人:Steven D'Aprano <steve at pearwood.info>
发送时间:2016-03-07 19:45
主题:Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage
收件人:"pypy-dev"<pypy-dev at python.org>
抄送:

On Mon, Mar 07, 2016 at 11:31:10AM +0200, Maciej Fijalkowski wrote: 
> I think you're misunderstanding what we're proposing. 
>  
> We're proposing utf8 representation completely hidden from the user, 
> where everything behaves just like cpython unicode (the len() example 
> you're showing is a narrow unicode build I presume?) 

Yes, CPython narrow builds don't handle Unicode code points in the  
supplementary planes well: they wrongly return len(2) for code points  
with a 4-byte UTF-16 representation: 

steve at runes:~$ python2.6 -c "print len(u'\U0010FFFF')"  # wide build 
1 
steve at runes:~$ python2.7 -c "print len(u'\U0010FFFF')"  # narrow build 
2 


That is no longer the case since Python 3.3, when the "flexible  
string representation" was introduced. 

https://www.python.org/dev/peps/pep-0393/ 

I think that it would be a very valuable experiment for PyPy to  
investigate moving to a UTF-8 internal representation. 


--  
Steve 
_______________________________________________ 
pypy-dev mailing list 
pypy-dev at python.org 
https://mail.python.org/mailman/listinfo/pypy-dev 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20160307/57bbe216/attachment.html>


More information about the pypy-dev mailing list