[Mailman-i18n] Unicode in headers

Barry A. Warsaw barry@zope.com
Sun, 22 Sep 2002 13:30:20 -0400


>>>>> "MvL" == Martin von Loewis <loewis@informatik.hu-berlin.de> writes:

    MvL> You need this argument to specify the encoding of the string
    MvL> *you are passing*, not (primarily) of the resulting
    MvL> Header. Since the argument is a Unicode string and not a byte
    MvL> string, the encoding argument is superfluous.

D'oh, of course you're right Martin.

    >> My proposal is to do a type check in Header.__str__() so that
    >> if the value of self.encode() returns a unicode string, we will
    >> coerce it to an 8-bit string like so:

    MvL> This is evil. You are losing data without any need.

    MvL> Instead, I propose the following procedure: - if a Unicode
    MvL> argument is passed to Header.__init__ or Header.append,
    MvL>   take the encoding only as a hint. As an argument to
    MvL> __init__, also record it as the default for later .append
    MvL> calls.
    MvL> - when encoding the header, encode all Unicode strings with
    MvL> the hint.  If that fails, encode them as UTF-8.

Alternatively, we could try to provoke a UnicodeError early, at the
__init__ or .append call by doing something like:

    def append(self, s, charset=None):
	# ...
        # Encoding check.  Better to know now whether we'll have an encoding
        # error than when we try to str'ify the header.  Let UnicodeErrors
        # percolate to the caller.
        if _isunicode(s):
            s.encode(str(charset))
        else:
            unicode(s, str(charset))
        self._chunks.append((s, charset))

In other words, the caller is claiming that the string being passed in
is encoded with the given character set (or the default if None is
used).  Fine, let's check that here since it will be easier to debug
if the UnicodeError is raised now, rather than when the Generator
tries to print the message header.

I think I could live with that, and will work out a different
algorithm in Mailman.

-Barry