[IronPython] x = unicode(someExtendedUnicodeString) fails.

Dino Viehland dinov at microsoft.com
Thu Dec 17 20:27:28 CET 2009


This is one of those bugs that it's simply not clear that it can be fixed at all.  The problem is that we have four different things to try and be compatible with:

unicode(some_unicode_string)
unicode(some_ascii_string)
str(some_unicode_string)
str(some_ascii_string)

But in IronPython we don't know whether some_*_string is Unicode or ASCII because they're always Unicode.  We also don't know if we're calling unicode or str because they're also the same thing.   So we have 4 possible behaviors in CPython but there can only be 1 behavior in IronPython.  Ultimately we need to pick which behaviors we want to be incompatible with :(  Maybe now that we have bytes we should look at changing which one we picked so that if you replace str with bytes we could match CPython.  But most likely this problem, and other subtle Unicode issues like it, won't be completely solvable until IronPython 3k.

From: users-bounces at lists.ironpython.com [mailto:users-bounces at lists.ironpython.com] On Behalf Of Vernon Cole
Sent: Thursday, December 17, 2009 11:06 AM
To: Discussion of IronPython
Subject: [IronPython] x = unicode(someExtendedUnicodeString) fails.

I just tripped over this one and it took some time to figure out what in blazes was going on. You may want to watch for it when porting CPython code.

I was cleaning up an input argument using
     s = unicode(S.strip().upper())
where S is the argument supplying the value I need to convert.

When I handed the function a genuine unicode string, such as in:
     assert Roman(u'\u217b') == 12 #unicode Roman number 'xii' as a single charactor
IronPython complains with:
    UnicodeEncodeError: ('unknown', '\x00', 0, 1, '')

The Python manual says:
If no optional parameters are given, unicode() will mimic the behaviour of str() except that it returns Unicode strings instead of 8-bit strings. More precisely, if object is a Unicode string or subclass it will return that Unicode string without any additional decoding applied.

It turns out that this was already reported on codeplex as:
http://ironpython.codeplex.com/WorkItem/View.aspx?WorkItemId=15372
but the reporting party did not catch the fact that he had located an incompatibility with documented behavior.
It has been setting on a back burner for some time.

Others may want to join me in voting this up.  Meanwhile I will add an unneeded exception handler to my own code.
--
Vernon Cole
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ironpython-users/attachments/20091217/c2b3084a/attachment.html>


More information about the Ironpython-users mailing list