From web11.forest at tibit.com Fri Mar 15 21:16:32 2013 From: web11.forest at tibit.com (Forest) Date: Fri, 15 Mar 2013 13:16:32 -0700 Subject: [Email-SIG] BytesFeedParser.close().get_boundary() returns string; I want bytes Message-ID: Hi there. I'm planning to write an email stream indexer that locates the byte offsets of each MIME body-part, sub-part, preamble, epilogue, etc. and avoids pulling an entire message into memory. (The existing email package doesn't seem to offer this functionality.) I will most likely use BytesFeedParser to parse message headers. I just discovered that the Message object produced by BytesFeedParser returns a string from get_boundary(). I expected it to return bytes, because my input is bytes and I will therefore have to compare each boundary with bytes while indexing. I can convert the string to bytes using the ascii codec, but I thought I'd raise the issue here in case the current behavior is a bug. Considering the restrictions that rfc 2046 places on boundary characters and its requirement to respect ancestor boundary markers when parsing nested messages, I'm struggling to think of a situation where the current behavior is useful. Shouldn't get_boundary() return something that can be found within the input data? From rdmurray at bitdance.com Fri Mar 15 23:34:54 2013 From: rdmurray at bitdance.com (R. David Murray) Date: Fri, 15 Mar 2013 18:34:54 -0400 Subject: [Email-SIG] BytesFeedParser.close().get_boundary() returns string; I want bytes In-Reply-To: References: Message-ID: <20130315223454.BFE59250BE2@webabinitio.net> On Fri, 15 Mar 2013 13:16:32 -0700, Forest wrote: > ascii codec, but I thought I'd raise the issue here in case the current > behavior is a bug. Considering the restrictions that rfc 2046 places on > boundary characters and its requirement to respect ancestor boundary markers > when parsing nested messages, I'm struggling to think of a situation where > the current behavior is useful. Shouldn't get_boundary() return something > that can be found within the input data? Well, you have to understand that the email package was written when Python didn't make any distinction between bytes and strings. What email in Python3 is doing is transforming the input into string (unicode) right away, and carrying any non-ascii bytes along until it has parsed enough information from the message to recover them and convert them into real unicode. BytesParser is parsing bytes input and *turning it into unicode*. The model is the same regardless of whether the input is bytes or already string. get_boundary is a method on the model (the Message) and is thus retrieving a string from the model and returning it. That said, we have discussed adding methods for accessing the binary form in various contexts. We have also discussed providing a stream version of message parsing and generation, and at a minimum a way to store message bodies externally (eg in a file). I've got these as development goals and welcome help in doing so. --David