[Tutor] simple unicode question

Albert-Jan Roskam fomcl at yahoo.com
Tue Aug 26 12:58:17 CEST 2014


----- Original Message -----
> From: Danny Yoo <dyoo at hashcollision.org>
> To: Albert-Jan Roskam <fomcl at yahoo.com>
> Cc: Python Tutor Mailing List <tutor at python.org>
> Sent: Saturday, August 23, 2014 2:53 AM
> Subject: Re: [Tutor] simple unicode question
> 
> On Fri, Aug 22, 2014 at 2:10 PM, Albert-Jan Roskam
> <fomcl at yahoo.com.dmarc.invalid> wrote:
>>  Hi,
>> 
>>  I have data that is either floats or byte strings in utf-8. I need to cast 
> both to unicode strings.
> 
> 
> Just to be sure, I'm parsing the problem statement above as:
> 
>     data :== float
>                 | utf-8-encoded-byte-string


Yep, that's how I meant it :-)

> because the alternative way to parse the statement in English:
> 
>     data :== float-in-utf-8
>                 | byte-string-in-utf-8
> 
> doesn't make any technical sense.  :P
> 
> 
> 
> 
>>  I am probably missing something simple, but.. in the code below, under 
> "float", why does [B] throw an error but [A] does not?
>> 
> 
>>  # float: cannot explicitly give encoding, even if it's the default
>>>>>  value = 1.0
>>>>>  unicode(value)      # [A]
>>  u'1.0'
>>>>>  unicode(value, sys.getdefaultencoding())  # [B]
>> 
>>  Traceback (most recent call last):
>>    File "<pyshell#22>", line 1, in <module>
>>      unicode(value, sys.getdefaultencoding())
>>  TypeError: coercing to Unicode: need string or buffer, float found
> 
> 
> Yeah.  Unfortunately, you're right: this doesn't make too much sense.
>
> What's happening is that the standard library overloads two
> _different_ behaviors to the same function unicode(). It's conditioned
> on whether we're passing in a single value, or if we're passing in
> two.  I would not try to reconcile a single, same behavior for both
> uses: treat them as two distinct behaviors.
> 
> Reference: https://docs.python.org/2/library/functions.html#unicode

Hi,
 
First, THANKS for all your replies!

Aahh, the first two lines in the link clarify things:
unicode(object='')
unicode(object[, encoding[, errors]])

I would  find it better/clearer if the docstring also started with these two lines, or, alternatively, with unicode(*args)

I have tried to find these two functions in the source code, but I can´t find them (but I don't speak C). Somewhere near line 1200 perhaps? http://hg.python.org/cpython/file/74236c8bf064/Objects/unicodeobject.c


> Specifically, the two arg case is meant where you've got an
> uninterpreted source of bytes that should be decoded to Unicode using
> the provided encoding.
> 
> 
> So for your problem statement, the function should look something like:
> 
> ###############################
> def convert(data):
>     if isinstance(data, float):
>         return unicode(data)
>     if isinstance(data, bytes):
>         return unicode(data, "utf-8")
>     raise ValueError("Unexpected data", data)
> ###############################
> 
> where you must use unicode with either the 1-arg or 2-arg variant
> based on your input data.


Interesting, you follow a "look before you leap" approach here, whereas `they` always say it is easier to ”ask forgiveness than permission” in python. But LBYL is much faster, which is relevant because the function could be called millions and millions of times. If have noticed before that try-except is quite an expensive structure to initialize (for instance membership testing with ´in´ is cheaper than try-except-KeyError when getting items from a dictionary)

In [1]: def convert1(data):
...:         if isinstance(data, float):
...:                 return unicode(data)
...:         if isinstance(data, bytes):
...:                 return unicode(data, "utf-8")
...:         raise ValueError("Unexpected data", data)
...:

In [2]: %%timeit map(convert1, map(float, range(10)) + list("abcdefghij"))
10000 loops, best of 3: 19 us per loop

In [3]: def convert2(data):
....:         try:
....:                 return unicode(data, encoding="utf-8")
....:         except TypeError:
....:                 return unicode(data)
....:         raise ValueError("Unexpected data", data)
....:

In [4]: %timeit map(convert2, map(float, range(10)) + list("abcdefghij"))
10000 loops, best of 3: 40.4 us per loop   


More information about the Tutor mailing list