[Mailman-i18n] Unicode in headers
Martin von Loewis
loewis@informatik.hu-berlin.de
Sat, 21 Sep 2002 23:08:36 +0200 (CEST)
> from email.Header import Header
> h = Header(u'[P\xf6stal]', 'us-ascii')
> s = str(h)
[...]
> But I think this may not be the right thing to do. For one thing,
> we're saying we want the header to be in the us-ascii character set.
I think you are confusing issues here: You are *not* saying that you
want the header to be in us-ascii. Instead, (to quote the docstring)
Specify both s's character set, and the default character set by
setting the charset argument to a Charset object
You need this argument to specify the encoding of the string *you are
passing*, not (primarily) of the resulting Header. Since the argument
is a Unicode string and not a byte string, the encoding argument is
superfluous.
Now, the documentation also says that it uses the argument as the "default
character set". By that, it does *not* mean that the entire header is going
to be encoding in that encoding. Instead, it means that this value is used
if later append calls do not declare an encoding.
> My proposal is to do a type check in Header.__str__() so that if the
> value of self.encode() returns a unicode string, we will coerce it to
> an 8-bit string like so:
This is evil. You are losing data without any need.
Instead, I propose the following procedure:
- if a Unicode argument is passed to Header.__init__ or Header.append,
take the encoding only as a hint. As an argument to __init__, also
record it as the default for later .append calls.
- when encoding the header, encode all Unicode strings with the hint.
If that fails, encode them as UTF-8.
Regards,
Martin