[Python-Dev] Mailbox module - timings and functionality changes

Tue Jun 29 21:26:31 CEST 2010

It should probably be opened in binary mode. Binary files do have a
.readline() method (returning a bytes object), and bytes objects have
a .startswith() method. The tell positions computed this way are even
compatible with those used by the text file. So you could do it this
way:

- open binary stream
- compute TOC by reading through it using .readline() and .tell()
- rewind (don't close)
- wrap the binary stream in a text stream
- use that for the rest of the code

--Guido

On Tue, Jun 29, 2010 at 10:54 AM, Steve Holden <steve at holdenweb.com> wrote:
> A.M. Kuchling wrote:
>> On Tue, Jun 29, 2010 at 11:40:50AM -0400, Steve Holden wrote:
>>> I will leave the profiler output to speak for itself, since I can find
>>> nothing much to say about it except that there's a hell of a lot of
>>> decoding going on inside mailbox.iterkeys().
>>
>> The problem is actually in _generate_toc(), which is reading through
>> the entire file to figure out where all the 'From' lines that start
>> messages are located.  TextIOWrapper()'s tell() method seems to be
>> very slow, so one help is to only call tell() when necessary; patch:
>>
>> -> svn diff Lib/
>> Index: Lib/mailbox.py
>> ===================================================================
>> --- Lib/mailbox.py    (revision 82346)
>> +++ Lib/mailbox.py    (working copy)
>> @@ -775,13 +775,14 @@
>>          starts, stops = [], []
>>          self._file.seek(0)
>>          while True:
>> -            line_pos = self._file.tell()
>>              line = self._file.readline()
>>              if line.startswith('From '):
>> +                line_pos = self._file.tell()
>>                  if len(stops) < len(starts):
>>                      stops.append(line_pos - len(os.linesep))
>>                  starts.append(line_pos)
>>              elif not line:
>> +                line_pos = self._file.tell()
>>                  stops.append(line_pos)
>>                  break
>>          self._toc = dict(enumerate(zip(starts, stops)))
>>
>> But should mailboxes really be opened in a UTF-8 encoding, or should
>> they be treated as 7-bit text?  I'll have to think about this.
>
> Neither! You can't open them as 7-bit text, because real-world email
> does contain bytes whose ordinal value exceeds 127. You can't open them
> using a text encoding because theoretically there might be ASCII headers
> that indicate that parts of the content are in specific character sets
> or encodings.
>
> If only we had a data structure that easily allowed us to manipulate
> 8-bit characters ...
>
> regards
>  Steve
> --
> Steve Holden           +1 571 484 6266   +1 800 494 3119
> See Python Video!       http://python.mirocommunity.org/
> Holden Web LLC                 http://www.holdenweb.com/
> UPCOMING EVENTS:        http://holdenweb.eventbrite.com/
> "All I want for my birthday is another birthday" -
>                                     Ian Dury, 1942-2000
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
>

-- 
--Guido van Rossum (python.org/~guido)