[Python-Dev] RE: xreadline speed vs readlines_sizehint

Tim Peters tim.one@home.com
Mon, 15 Jan 2001 05:52:09 -0500


[Mark Favas]
> ...
> The lines range in length from 96 to 747 characters, with
> 11% @ 233, 17% @ 252 and 52% @ 254 characters, so #1 [a vendor
> who actually optimized fgets()] looks promising - most lines are
> long enough to trigger a realloc.

Plus as soon as you spill over the stack buffer, I make you pay for filling
1024 new bytes with newlines before the next fgets() call, and almost all of
those are irrelevant to you.  It doesn't degrade gracefully.  Alas, I tried
several "adaptive" schemes (adjusting how much of the initial segment of a
larger stack buffer they would use, based on the actual line lengths seen in
the past), but the costs always exceeded the savings on my box.

> Cranking up INITBUFSIZE in ms_getline_hack to 260 from 200
> improves thing again, by another 25%:
> total 131426612 chars and 514216 lines
> count_chars_lines     5.081  5.066
> readlines_sizehint    3.743  3.717
> using_fileinput      11.113 11.100
> while_readline        6.100  6.083
> for_xreadlines        3.027  3.033

Well, I couldn't let you forego *all* of 25%.  The current fileobject.c has
a stack buffer of 300 bytes, but only uses 100 of them on the first gets()
call.  On a very quiet machine, that saved 3-4% of the runtime on *my* test
case, whose line lengths are typical of the text files I crunch over, so I'm
happy for me.  If 100 bytes aren't enough, it must call fgets() again, but
just appends the next call into the full 300-byte buffer.  So it saves the
realloc for lines under 300 chars.

> Apart from the name <grin>, I like ms_getline_hack...

Ya, it's now the non-pejorative getline_via_fgets().  I hate that I became a
grown-up <0.9 wink>.

time-to-pick-wings-off-of-flies-ly y'rs  - tim