From web11.forest at tibit.com  Fri Mar 15 21:16:32 2013
From: web11.forest at tibit.com (Forest)
Date: Fri, 15 Mar 2013 13:16:32 -0700
Subject: [Email-SIG] BytesFeedParser.close().get_boundary() returns string;
	I want bytes
Message-ID: <lts6k89br7f0hihk4ce4juoeehak6srm5m@4ax.com>

Hi there.

I'm planning to write an email stream indexer that locates the byte offsets
of each MIME body-part, sub-part, preamble, epilogue, etc. and avoids
pulling an entire message into memory.  (The existing email package doesn't
seem to offer this functionality.)  I will most likely use BytesFeedParser
to parse message headers.

I just discovered that the Message object produced by BytesFeedParser
returns a string from get_boundary().  I expected it to return bytes,
because my input is bytes and I will therefore have to compare each boundary
with bytes while indexing.  I can convert the string to bytes using the
ascii codec, but I thought I'd raise the issue here in case the current
behavior is a bug.  Considering the restrictions that rfc 2046 places on
boundary characters and its requirement to respect ancestor boundary markers
when parsing nested messages, I'm struggling to think of a situation where
the current behavior is useful.  Shouldn't get_boundary() return something
that can be found within the input data?

From rdmurray at bitdance.com  Fri Mar 15 23:34:54 2013
From: rdmurray at bitdance.com (R. David Murray)
Date: Fri, 15 Mar 2013 18:34:54 -0400
Subject: [Email-SIG] BytesFeedParser.close().get_boundary() returns
	string; I want bytes
In-Reply-To: <lts6k89br7f0hihk4ce4juoeehak6srm5m@4ax.com>
References: <lts6k89br7f0hihk4ce4juoeehak6srm5m@4ax.com>
Message-ID: <20130315223454.BFE59250BE2@webabinitio.net>

On Fri, 15 Mar 2013 13:16:32 -0700, Forest <web11.forest at tibit.com> wrote:
> ascii codec, but I thought I'd raise the issue here in case the current
> behavior is a bug.  Considering the restrictions that rfc 2046 places on
> boundary characters and its requirement to respect ancestor boundary markers
> when parsing nested messages, I'm struggling to think of a situation where
> the current behavior is useful.  Shouldn't get_boundary() return something
> that can be found within the input data?

Well, you have to understand that the email package was written
when Python didn't make any distinction between bytes and strings.
What email in Python3 is doing is transforming the input into
string (unicode) right away, and carrying any non-ascii
bytes along until it has parsed enough information from the message
to recover them and convert them into real unicode. 

BytesParser is parsing bytes input and *turning it into
unicode*.  The model is the same regardless of whether
the input is bytes or already string. get_boundary
is a method on the model (the Message) and is thus
retrieving a string from the model and returning it.

That said, we have discussed adding methods for accessing the
binary form in various contexts.  We have also discussed
providing a stream version of message parsing and generation,
and at a minimum a way to store message bodies externally
(eg in a file).  I've got these as development goals and
welcome help in doing so.

--David