[Python-Dev] Mailbox module - timings and functionality changes

Steve Holden steve at holdenweb.com
Tue Jun 29 19:54:09 CEST 2010


A.M. Kuchling wrote:
> On Tue, Jun 29, 2010 at 11:40:50AM -0400, Steve Holden wrote:
>> I will leave the profiler output to speak for itself, since I can find
>> nothing much to say about it except that there's a hell of a lot of
>> decoding going on inside mailbox.iterkeys().
> 
> The problem is actually in _generate_toc(), which is reading through
> the entire file to figure out where all the 'From' lines that start
> messages are located.  TextIOWrapper()'s tell() method seems to be
> very slow, so one help is to only call tell() when necessary; patch:
> 
> -> svn diff Lib/
> Index: Lib/mailbox.py
> ===================================================================
> --- Lib/mailbox.py	(revision 82346)
> +++ Lib/mailbox.py	(working copy)
> @@ -775,13 +775,14 @@
>          starts, stops = [], []
>          self._file.seek(0)
>          while True:
> -            line_pos = self._file.tell()
>              line = self._file.readline()
>              if line.startswith('From '):
> +                line_pos = self._file.tell()
>                  if len(stops) < len(starts):
>                      stops.append(line_pos - len(os.linesep))
>                  starts.append(line_pos)
>              elif not line:
> +                line_pos = self._file.tell()
>                  stops.append(line_pos)
>                  break
>          self._toc = dict(enumerate(zip(starts, stops)))
> 
> But should mailboxes really be opened in a UTF-8 encoding, or should
> they be treated as 7-bit text?  I'll have to think about this.

Neither! You can't open them as 7-bit text, because real-world email
does contain bytes whose ordinal value exceeds 127. You can't open them
using a text encoding because theoretically there might be ASCII headers
that indicate that parts of the content are in specific character sets
or encodings.

If only we had a data structure that easily allowed us to manipulate
8-bit characters ...

regards
 Steve
-- 
Steve Holden           +1 571 484 6266   +1 800 494 3119
See Python Video!       http://python.mirocommunity.org/
Holden Web LLC                 http://www.holdenweb.com/
UPCOMING EVENTS:        http://holdenweb.eventbrite.com/
"All I want for my birthday is another birthday" -
                                     Ian Dury, 1942-2000


More information about the Python-Dev mailing list