[Mailman-i18n] Unicode in headers

Sat, 21 Sep 2002 13:00:43 -0400

I've been trying to fix the outstanding problems with "funny"
characters in real names in Mailman[*] and along the way I ran into a
situation that I /think/ needs to be addressed in the email package.
I'm not sure this is a good fix, let alone the right fix so I wanted
to get some feedback from these two mailing lists.

Say I create a Header instance like so:

    from email.Header import Header
    h = Header(u'[P\xf6stal]', 'us-ascii')
    s = str(h)

what would you expect the value of s to be?

It's a bit of a trick question because in the current version, the
str(h) will raise a UnicodeError since the h.encode() will be a
unicode string containing non-ascii characters.

But I think this may not be the right thing to do.  For one thing,
we're saying we want the header to be in the us-ascii character set.
For another, the RFCs state that headers need to be ascii characters
and we should encode them if necessary.  OTOH, what we're doing /is/ a
bit bogus since the value is clearly not in the requested character
set.  But OTOOH, I don't think we should have to check the value
and do a bunch of coercion before we create the Header instance.

My proposal is to do a type check in Header.__str__() so that if the
value of self.encode() returns a unicode string, we will coerce it to
an 8-bit string like so:

    def __str__(self):
        """A synonym for self.encode().
        Guarantees that the return value contains only ASCII characters.
        """
        s = self.encode()
        if isinstance(s, type(u'')):
            return s.encode(str(self._charset), 'replace')
        return s

Here's a new test case that fails without this change, but succeeds
with it (with no regressions).

    def test_unicode_value(self):
        eq = self.assertEqual
        v = u'[P\xf6stal]'
        h = Header(v, 'us-ascii')
        eq(str(h), '[P?stal]')

In the view of doing what's most useful, I'd like to make this
change, but I still don't trust my judgement about things unicode, so
I'd like to get some other opinions.

If we don't do this, then we'll probably have to add some defense in
Generator._write_headers(), which wants to do

            text = '%s: %s' % (h, v)

That'll raise the UnicodeError in this situation, and because this can
be fairly widely removed from what might be considered the real error,
it's difficult to debug.

-Barry

[*] BTW, Martin, Ben, Tokio and others have been very helpful here.
Thanks!  And I hope to have fixes in place soon.