[Python-Dev] Unicode: When Things Get Hairy
Guido van Rossum
guido@python.org
Sat, 11 Mar 2000 07:16:06 -0500
[Moshe discovers that u"a" in "bbba" raises TypeError]
[Marc-Andre]
> > Hmm, this must have been introduced by your contains code...
> > it did work before.
>
> Nope: the string "in" semantics were forever special-cased. Guido beat me
> soundly for trying to change the semantics...
But I believe that Marc-Andre added a special case for Unicode in
PySequence_Contains. I looked for evidence, but the last snapshot that
I actually saved and built before Moshe's code was checked in is from
2/18 and it isn't in there. Yet I believe Marc-Andre. The special
case needs to be added back to string_contains in stringobject.c.
> > The normal action taken by the Unicode and the string
> > code in these mixed type situations is to first
> > convert everything to Unicode and then retry the operation.
> > Strings are interpreted as UTF-8 during this conversion.
>
> Hmmm....PySeqeunce_Contains doesn't do any conversion of the arguments.
> Should it? (Again, it didn't before). If it does, then the order of
> testing for seq_contains and seq_getitem and conversions
Or it could be done this way.
> > Perhaps I should also add a tp_contains slot to the
> > Unicode object which then uses the above API as well.
Yes.
> But that wouldn't help at all for
>
> u"a" in "abbbb"
It could if PySeqeunce_Contains would first look for a string and a
unicode argument (in either order) and in that case convert the string
to unicode.
> PySequence_Contains only dispatches on the container argument :-(
>
> (BTW: I discovered it while contemplating adding a seq_contains (not
> tp_contains) to unicode objects to optimize the searching for a bit.)
You may beat Marc-Andre to it, but I'll have to let him look at the
code anyway -- I'm not sufficiently familiar with the Unicode stuff
myself yet.
BTW, I added a tag "pre-unicode" to the CVS tree to the revisions
before the Unicode changes were made.
--Guido van Rossum (home page: http://www.python.org/~guido/)