Problem with sets and Unicode strings

Tue Jun 27 17:59:23 EDT 2006

On 6/27/06, Dennis Benzinger <Dennis.Benzinger at gmx.net> wrote:
> Serge Orlov wrote:
> > On 6/27/06, Dennis Benzinger <Dennis.Benzinger at gmx.net> wrote:
> >> Hi!
> >>
> >> The following program in an UTF-8 encoded file:
> >>
> >>
> >> # -*- coding: UTF-8 -*-
> >>
> >> FIELDS = ("Fächer", )
> >> FROZEN_FIELDS = frozenset(FIELDS)
> >> FIELDS_SET = set(FIELDS)
> >>
> >> print u"Fächer" in FROZEN_FIELDS
> >> print u"Fächer" in FIELDS_SET
> >> print u"Fächer" in FIELDS
> >>
> >>
> >> gives this output
> >>
> >>
> >> False
> >> False
> >> Traceback (most recent call last):
> >>    File "test.py", line 9, in ?
> >>      print u"FÃ€cher" in FIELDS
> >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
> >> ordinal not in range(128)
> >>
> >>
> >> Why do the first two print statements succeed and the third one fails
> >> with an exception?
> >
> > Actually all three statements fail to produce correct result.
>
> So this is a bug in Python?

No.

> > frozenset remove the exception?
> >
> > Because sets use hash algorithm to find matches, whereas the last
> > statement directly compares a unicode string with a byte string. Byte
> > strings can only contain ascii characters, that's why python raises an
> > exception. The problem is very easy to fix: use unicode strings for
> > all non-ascii strings.
>
> No, byte strings contain characters which are at least 8-bit wide
> <http://docs.python.org/ref/types.html>.

Yes, but later it's written that non-ascii characters do not have
universal meaning assigned to them. In other words if you put byte
0xE4 into a bytes string all python knows about it is that it's *some*
character. If you put character U+00E4 into a unicode string python
knows it's a "latin small letter a with diaeresis". Trying to compare
*some* character with a specific character is obviously undefined.

> But I don't understand what
> Python is trying to decode and why the exception says something about
> the ASCII codec, because my file is encoded with UTF-8.

Because byte strings can come from different sources (network, files,
etc) not only from the sources of your program python cannot assume
all of them are utf-8. It assumes they are ascii, because most of
wide-spread text encodings are ascii bases. Actually it's a guess,
since there are utf-16, utf-32 and other non-ascii encodings. If you
want to experience the life without guesses put
sys.setdefaultencoding("undefined") into site.py