[Python-Dev] Unicode: When Things Get Hairy

M.-A. Lemburg mal@lemburg.com
Sat, 11 Mar 2000 14:32:57 +0100


Guido van Rossum wrote:
> 
> [Moshe discovers that u"a" in "bbba" raises TypeError]
> 
> [Marc-Andre]
> > > Hmm, this must have been introduced by your contains code...
> > > it did work before.
> >
> > Nope: the string "in" semantics were forever special-cased. Guido beat me
> > soundly for trying to change the semantics...
> 
> But I believe that Marc-Andre added a special case for Unicode in
> PySequence_Contains.  I looked for evidence, but the last snapshot that
> I actually saved and built before Moshe's code was checked in is from
> 2/18 and it isn't in there.  Yet I believe Marc-Andre.  The special
> case needs to be added back to string_contains in stringobject.c.

Moshe was right: I had probably not checked the code because
the obvious combinations worked out of the box... the
only combination which doesn't work is "unicode in string".
I'll fix it next week.

BTW, there's a good chance that the string/Unicode integration
is not complete yet: just keep looking for them.

> > > The normal action taken by the Unicode and the string
> > > code in these mixed type situations is to first
> > > convert everything to Unicode and then retry the operation.
> > > Strings are interpreted as UTF-8 during this conversion.
> >
> > Hmmm....PySeqeunce_Contains doesn't do any conversion of the arguments.
> > Should it? (Again, it didn't before). If it does, then the order of
> > testing for seq_contains and seq_getitem and conversions
> 
> Or it could be done this way.
> 
> > > Perhaps I should also add a tp_contains slot to the
> > > Unicode object which then uses the above API as well.
> 
> Yes.
> 
> > But that wouldn't help at all for
> >
> > u"a" in "abbbb"
> 
> It could if PySeqeunce_Contains would first look for a string and a
> unicode argument (in either order) and in that case convert the string
> to unicode.

I think the right way to do
this is to add a special case to seq_contains in the
string implementation. That's how most other auto-coercions
work too.

Instead of raising an error, the implementation would then
delegate the work to PyUnicode_Contains().
 
> > PySequence_Contains only dispatches on the container argument :-(
> >
> > (BTW: I discovered it while contemplating adding a seq_contains (not
> > tp_contains) to unicode objects to optimize the searching for a bit.)
> 
> You may beat Marc-Andre to it, but I'll have to let him look at the
> code anyway -- I'm not sufficiently familiar with the Unicode stuff
> myself yet.

I'll add that one too.
 
BTW, Happy Birthday, Moshe :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/