possible unicode bug in implicit string concatenation?

Sat Sep 11 03:09:21 EDT 2004

Fahd Khan wrote:
>>>>u'\u12345' u'foo'.encode('ascii')
> 
> Traceback (most recent call last):
>   File "<interactive input>", line 1, in ?
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u1234' in
> position 0: ordinal not in range(128)
> 
> 
> Is this a bug, or is my understanding of how Python works flawed? 

Yes :-) Your understanding is flawed.

> I
> tried tracing it within the interpreter itself bug got lost after a
> little while... I'm familiar with the interpreter loop, but not the
> parser, and I suspect this is something to do with implicit string
> concatenation being parsed differently from the explicit version, i.e.
> the explicit version uses the + operator slot, while the implicit
> version does something else. Any ideas?

During parsing, strings are concatenated. And concatenation is
the same as +. So the expression at the top of this message is
the same as u'\u12345foo'.encode('ascii'). That fails because
\u1234 is not supported in ASCII. Now,

u'\u12345'+u'foo'.encode('ascii')

is something completely different: concatenation does not happen
during parsing, but only at execution. The computation of this
expression is as follows

u'foo'.encode('ascii') gives 'foo'
u'\u12345'+'foo' finds that Unicode and byte strings are to be
added. This causes the byte string to be coerced to Unicode,
computing
'foo'.decode(sys.getdefaultencoding())
sys.getdefaultencoding() gives 'ascii'
'foo'.decode('ascii') gives u'foo'
u'\u12345'+u'foo' gives u'\u12345foo'

Regards,
Martin