[Python-3000] encoding hell

Mon Sep 4 01:04:34 CEST 2006

Anders J. Munch wrote:

> Watch out!  There's an essentiel difference between files and
> bidirectional communications channels that you need to take into
> account.  For a TCP connection, input and output can be seen as
> isolated from one another, with each their own stream position, and
> each their own contents.  For read/write files, it's a whole different
> ballgame, because stream position and data are shared.
> 
> That means you cannot use the same buffering code for both cases.  For
> files, whenever you write something, you need to take into account
> that that may overlap your read buffer or change read position.  You
> should take another look at layer.BufferingLayer with that in mind.
> 
> regards, Anders

This is a better explanation of some of the comments I was raising 
earlier: The choice of buffering strategy depends on a number of factors 
related to how the stream is going to be used, as well as the internal 
implementation of the stream. A buffering strategy that works well for a 
socket won't work very well for a DBMS.

When I stated earlier that 'the OS can do a better job of buffering than 
we can', what I meant to say was somewhat broader than that - which is 
that each layer is, in many cases, a better judge of what *kind* of 
buffering it needs than the person assembling the layers.

This doesn't mean that each layer has to implement its own buffering 
algorithm. The common buffering algorithms can be factored out into 
their own objects -- but what I'd suggest is that the choice of buffer 
algorithm not *normally* be exposed to the person constructing the io stack.

Thus, when creating a standard "line reader", instead of having the user 
call:

	fh = TextReader( Buffer( File( ... ) ) )

Instead, let the TextReader choose the kind of buffer it wants and 
supply that part automatically. There are several reasons why I think 
this would work better:

1) You can't simply stick just any buffer object in the middle there and 
expect it to work. Different buffer strategies have different 
interfaces, and trying to meld them all into one uber-interface would 
make for a very complex interface.

2) The TextReader knows perfectly well what kind of buffer it needs. 
Depending on how TextReader is implemented, it might want a serial, 
read-only buffer that allows a limited degree of look-ahead buffering so 
that it can find the line breaks. Or it might want a pair of buffers - 
one decoded, one encoded. There's no way that the user can know what 
kind of buffer to use without knowing the implementation details of 
TextReader.

3) TextReader can be optimized even more if it is allowed to 'peek' 
inside the internals of the buffer - something that would not be allowed 
  if it had to conform to calling the buffer through a standard interface.

More generally, the choice of buffer depends on the usage pattern for 
reading / writing to the file - and that usage pattern is embodied in 
the definition of "TextReader". By creating a "TextReader" object, the 
user is stating their intention to read the file a certain way, in a 
certain order, with certain performance characteristics. The choice of 
buffering derives directly from those usage patterns. So the two go hand 
in hand.

Now, I'm not saying that you can't stick additional layers in-between 
TextReader and FileStream if you want to. An example might be the 
"resync" layer that you mentioned, or a journaling layer that insures 
that all writes are recoverable. I'm merely saying that for the specific 
issue of buffering, I think that the choice of buffer type is 
complicated, and requires knowledge that might not be accessible to the 
person assembling the stack.

-- Talin