email header decoding fails

Fri Apr 11 01:56:07 EDT 2008

On Apr 10, 5:18 pm, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
wrote:
> En Thu, 10 Apr 2008 05:45:41 -0300, ZeeGeek <ZeeG... at gmail.com> escribió:
>
> > On Apr 10, 4:31 pm, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
> > wrote:
> >> En Wed, 09 Apr 2008 23:12:00 -0300, ZeeGeek <ZeeG... at gmail.com>
> >> escribió:
>
> >> > It seems that the decode_header function in email.Header fails when
> >> > the string is in the following form,
>
> >> > '=?gb2312?Q?=D0=C7=C8=FC?=(revised)'
> >>      An 'encoded-word' that appears within a
> >>      'phrase' MUST be separated from any adjacent 'word', 'text' or
> >>      'special' by 'linear-white-space'.
>
> > Thank you very much, Gabriel.
>
> The above just says "why" decode_header refuses to decode it, and why it's
> not a bug. But if you actually have to deal with those malformed headers,
> some heuristics may help. By example, if you *know* your mails typically
> specify gb2312 encoding, or iso-8859-1, you may look for things that look
> like the example above and "fix" it.

Right now what I'm doing is to use re.sub(r'(=\?([^\?]*\?){3}=)', r'
\1 ', orig_string) to detect and place an extra white space before and
after every occurrence of an encoded string. Then the whole string is
compliant with the standard and decode_header can decode it properly.