[Mailman-Users] Scrubbing charset-unspecified text

Tue May 2 17:46:26 CEST 2006

Roger Lynn wrote:
>
>I'm running Mailman 2.1.7, packaged for Debian (although I don't think
>that's relevant to this question). A list that I administer has non-digest
>scrubbing enabled. An email was recently sent to it with the following headers:
>
>Content-Type: text/plain
>Content-Disposition: inline
>MIME-Version: 1.0
>X-Mailer: MIME-tools 5.411 (Entity 5.404)
>Date: Mon, 01 May 2006 18:47:30 +0100
>Subject: [...]
>To: [...]
>From: [...]
>X-Mailer: SINA Webmail 6.00.
>Reply-To: [...]
>X-Sina-Mail-Agent: sinadeliver-6.00-1.97
>Message-Id: [...]
>X-Virus-Scanned: by myinternet myAV on ngflrtr1
>Content-Transfer-Encoding: quoted-printable

Which seems like a mal-formed message. The issue is the

Content-Disposition: inline

which should only appear in sub-part headers, not in the message
headers.

>This resulted in the contents of the email being replaced with:
>
>An embedded and charset-unspecified text was scrubbed...
>Name: not available
>Url: http://[...]/attachments/20060501/aad799ed/attachment.ksh
>
>Why is it necessary to scrub plain text in this instance, when no character
>set is specified? Couldn't it just be assumed that it is us-ascii?

It is a bug or at least insufficiently robust code. We shouldn't be
relying on the Content-Disposition: header to determine a sub-part.

>If I were to comment out the following code from process() in Scrubber.py,
>would there be any consequences other than allowing messages like the above
>through to the list?

Yes. The consequence is that you could get a message which contained an
actual "charset-unspecified text" attachment with an actual character
set different from that of the first text/plain part and then these
two parts with perhaps incompatible character sets would be
'flattened' together into one part.

Here is a suggested change to the code you quoted.

Replace

            if part.get('content-disposition') and \
               not part.get_content_charset():
                omask = os.umask(002)

with

            if part.get('content-disposition') and \
               msg.is_multipart() and \
               not part.get_content_charset():
                omask = os.umask(002)

This is not really a proper fix, but I think it will avoid the problem
in your case.

>
>Incidentally, why does the attachment have the suffix ".ksh"? It seems
>rather unusual. I'm using the following settings:
>
>SCRUBBER_DONT_USE_ATTACHMENT_FILENAME	= False
>SCRUBBER_USE_ATTACHMENT_FILENAME_EXTENSION = True

There is no 'filename' in what we mistakenly think is an attachment, so
we guess the extension based on the Content-Type: which is text/plain.

We use effectively the Python library call

mimetypes.guess_all_extensions('text/plain', strict=False)

which returns this list

['.ksh', '.asc', '.h', '.c', '.txt']

and we pick the first one.

-- 
Mark Sapiro <msapiro at value.net>       The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan