[IPython-dev] Buffers

Tue Jul 27 18:22:44 EDT 2010

Do you guys want to chat about this on IRC?

On Tue, Jul 27, 2010 at 3:16 PM, MinRK <benjaminrk at gmail.com> wrote:

> Okay, so it sounds like we should never interpret unicode objects as simple
> strings, if I am understanding the arguments correctly.
>
> I certainly don't think that sending anything providing the buffer
> interface should raise an exception, though. It should be up to the user to
> know whether the buffer will be legible on the other side.
>
> The situation I'm concerned about is that json gives you unicode strings,
> whether that was the input or not.
> s1 = 'word'
> j = json.dumps(s1)
> s2 = json.loads(j)
> # u'word'
>
> Now, if you have that logic internally, and you are sending messages based
> on messages you received, unless you wrap _every single thing_ you pass to
> send in str(), then you are calling things like send(u'word').  I really
> don't think that should not raise an error, and trunk surely does.
>
> The other options are either to always interpret unicode objects like
> everything else, always sending by its buffer, trusting that the receiving
> end will call decode (which may require that the message be copied at least
> one extra time). This would also mean that if A sends something packed by
> json to B, B unpacks it, and it included a str to be sent to C, then B has a
> unicode wrapped version of it (not a str). If B then sends it on to C, C
> will get a string that will _not_ be the same as the one A packed and sent
> to B. I think this is terrible, since it seems like such an obvious (already
> done) fix in zmq.
>
> I think that the vast majority of the time you are faced with unicode
> strings, they are in fact simple str instances that got wrapped, and we
> should expect that and deal with it.
>
> I decided to run some tests, since I currently have a UCS2 (OSX 10.6.4) and
> UCS4 (ubuntu 10.04) machine
> They are running my `patches' zmq branch right now, and I'm having no
> problems.
>
> case 1: sys.defaultencoding = utf8 on mac, ascii on ubuntu.
> a.send(u'who') # valid ascii, valid utf-8, ascii string sent
> b.recv()
> # 'who'
>
> u=u'whoπ'
> # u'who\xcf\x80'
>
> a.send(u'whoπ') # valid ascii, valid utf-8, utf string sent
> b.recv().decode('utf-8')
> # u'who\xcf\x80'
>
> case 2: sys.defaultencoding = ascii,ascii
> a.send(u'who') # valid ascii, string sent
> b.recv()
> # 'who'
>
> u=u'whoπ'
> u
> # u'who\xcf\x80'
>
> a.send(u'whoπ') # invalid ascii, buffer sent
> s = b.recv()
> # 'w\x00h\x00o\x00\xcf\x00\x80\x00'
> s.decode('utf-8')
> # UnicodeError (invalid utf-8)
> s.decode('utf16')
> # u'who\xcf\x80'
>
>
> It seems that the _buffer_ of a unicode object is always utf16
>
> I also did it with utf-8 on both sides, and threw in some latin-1, and
> there was no difference between those and case 1.
>
> I can't find the problem here.
>
> As far as I can tell, a unicode object is:
> a) a valid string for the sender, and the string is sent in the sender's
> default encoding
> on the receiver:
>     sock.recv().decode(sender.defaultcodec)
>     gets the object back
> b) not a valid string for the sender, and the utf16 buffer is sent
> on the receiver:
>     sock.recv().decode('utf16')
>     always seems to work
>
> I even tried various instances of specifying the encoding as latin, etc.
> and sending math symbols (√,∫) in various directions, and invariably the
> only thing I needed to know on the receiver was the default encoding on the
> sender. Everything was reconstructed properly with either
> s.decode(sender.defaultcodec) or s.decode(utf16), depending solely on
> whether str(u) would raise on the sender.
>
> Are there specific symbols and/or directions where I should see a problem?
> Based on reading, I figured that math symbols would if anything, but they
> certainly don't in either direction.
>
> -MinRK
>
>
> On Tue, Jul 27, 2010 at 13:13, Fernando Perez <fperez.net at gmail.com>wrote:
>
>> On Tue, Jul 27, 2010 at 12:23 PM, Brian Granger <ellisonbg at gmail.com>
>> wrote:
>> > This is definitely an issue.  Also, someone could set their own custom
>> > unicode encoding by hand and that would mess this up as well.
>> >
>> >>
>> >> If it is a problem, then there are some options:
>> >>
>> >> - disallow communication between ucs 2/4 pythons.
>> >
>> > But this doesn't account for other encoding/decoding setups.
>>
>> Note that when I mention ucs2/4, that refers to the *internal* python
>> storage of all unicode objects.  That is: ucs2/4 is how the buffer,
>> under the hood for a unicode string, is written in memory.  There are
>> no other encoding/decoding setups for Python, this is strictly a
>> compile-time flag and can only be either ucs2 or ucs4.
>>
>> You can see the value by typing:
>>
>> In [1]: sys.maxunicode
>> Out[1]: 1114111
>>
>> That's ucs-4, and that number is the whole of the current unicode
>> standard.  If you get instead 2^16, it means you have a ucs2 build,
>> and python can only encode strings in the BMP (basic multilingual
>> plane, where all living languages are stored but not math symbols,
>> musical symbols and some extended Asian characters).
>>
>> Does that make sense?
>>
>> Note that additionally, it's exceedingly rare for anyone to set up a
>> custom encoding for unicode.  It's hard to do right, requires plumbing
>> in the codecs module, and I think Python supports out of the box
>> enough encodings that I can't imagine why anyone would write a new
>> encoding.  But regardless, if a string has been encoded then it's OK:
>> now it's bytes, and there's no problem.
>>
>> >> - detect a mismatch and encode/decode all unicode strings to utf-8 on
>> >> send/receive, but allow raw buffer sending if there's no mismatch.
>> >
>> > This will be tough though if users set their own encoding.
>>
>> No, the issue with users having something other than utf-8 is
>> orthogonal to this.  The idea would be: - if both ends of the
>> transmission have conflicting ucs internals, then all unicode strings
>> are sent as utf-8.  If a user sends an encoded string, then that's
>> just a bunch of bytes and it doesn't matter how they encoded it, since
>> they will be responsible for decoding it on the other end.
>>
>> But I still don't like this approach because the ucs2/4 mismatch is a
>> pair-wise problem, and for a multi-node setup managing this pair-wise
>> switching of protocols can be a nightmare.  And let's not even get
>> started on what pub/sub sockets would do with this...
>>
>> >> - *always* encode/decode.
>> >>
>> >
>> > I think this is the option that I prefer (having users to this in their
>> > application code).
>>
>> Yes, now that I think of pub/sub sockets, I don't think we have a
>> choice.  It's a bit unfortunate that Python recently decided *not* to
>> standardize on a storage scheme:
>>
>> http://mail.python.org/pipermail/python-dev/2008-July/080886.html
>>
>> because it means forever paying the price of encoding/decoding in this
>> context.
>>
>> Cheers,
>>
>> f
>>
>> ps - as you can tell, I've been finally doing my homework on unicode,
>> in preparation for an eventual 3.x transition :)
>>
>
>

-- 
Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo
bgranger at calpoly.edu
ellisonbg at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20100727/4ddd8584/attachment.html>