Fwd: Understand workflow about reading and writing files in Python

Windson Yang wiwindson at gmail.com
Sun Jun 23 22:12:18 EDT 2019


Thank you so much for you review DL Neil, it really helps :D. However,
there are some parts still confused me, I replyed as below.

DL Neil <PythonList at danceswithmice.info> 于2019年6月19日周三 下午2:03写道:

> I've not gone 'back' to refer to any ComSc theory on buffer-management.
> Perhaps you might benefit from such?
>
> I just take a crash course on it so I want to know if I understand the
details correctly :D


> I like your use of the word "shift", so I'll continue to use it.
>
> There are three separate units of data to consider - each of which could
> be called a "buffer". To avoid confusing (myself) I'll only call the
> 'middle one' that:
> 1 the unit of data 'coming' from the data-source
> 2 the "buffer" you are implementing
> 3 the unit of data 'going' out to a data-destination.
>
> Just to make it clear, when we use `f.write('abc')` in python, (1) means
'abc', (2) means the buffer handle by Python (by default 8kb), (2) means
the file *f* we are writing to, right?

1 and 3 may be dictated to you, eg hardware or file specifications, code
> requirements, etc.
>
> So, data is shifted into the (2) buffer in a unit-size decided by (1) -
> in most use-cases each incoming unit will be the same size, but remember
> that the last 'unit' may/not be full-size. Similarly, data shifted out
> from the (2) buffer to (3).
>
> The size of (1) is likely not that of (3) - otherwise why use a
> "buffer"? The size of (2) must be larger than (1) and larger than (2) -
> for reasons already illustrated.
>

Is this a typo? (2) larger than (1) larger than (2)?

>
> I recall learning how to use buffers with a series of hand-drawn block
> diagrams. Recommend you try similarly!
>
>
> Now, let's add a few critiques, as requested (interposed below):-
>
>
> On 19/06/19 3:53 PM, Windson Yang wrote:t
> > I'm trying to understand the workflow of how Python read/writes data with
> > buffer. I will be appreciated if someone can review it.
> >
> > ### Read n data
>
> - may need more than one read operation if the size of (3) "demands"
> more data than the size of (1)/one "read".
>

Looks like the size of len of one read() depends on
https://github.com/python/cpython/blob/master/Modules/_io/bufferedio.c#L1655
 ?

>
> > 1. If the data already in the buffer, return data
>
> - this a data-transfer of size (3)
>
> For extra credit/an unnecessary complication (but probable speed-up!):
> * if the data-remaining is less than size (3) consider a read-ahead
> mechanism
>
> > 2. If the data not in the buffer:
>
> - if buffer's data-len < size (3)
>
> >      1. copy all the current data from the buffer
>
> * if "buffer" is my (2), then no-op
>

I don't understand your point here, when we read data we would copy some
data from the current buffer from python, right? (
https://github.com/python/cpython/blob/master/Modules/_io/bufferedio.c#L1638),
we use `out` (which point to res) to store the data here.

>
> >      2. create a new buffer object, fill the new buffer with raw read
> which
> > read data from disk.
>
> * this becomes: perform read operation and append incoming data (size
> (1)) to "buffer" - hence why "buffer" is larger than (1), by definition.
> NB if size (1) is smaller than size (3), multiple read operations may be
> necessary. Thus a read-loop!?
>
> Yes, you are right, here is a while loop (
https://github.com/python/cpython/blob/master/Modules/_io/bufferedio.c#L1652
)

>
> >      3. concat the data in the old buffer and new buffer.
>
> = now no-op. Hopefully the description of 'three buffers' removes this
> confusion of/between buffers.
>
>  I don't get it. When we call the function like seek(0) then read(1000),
we can still use the data from buffer from python, right?

>
> >      4. return the data
>
> * make the above steps into a while-loop and there won't be a separate
> step here (it is the existing step 1!)
>
>
> * build all of the above into a function/method, so that the 'mainline'
> only has to say 'give me data'!
>
>
> > ### Write n data
> > 1. If data small enough to fill into the buffer, write data to the buffer
>
> =yes, the data coming from source (1), which in this case is 'your' code
> may/not be sufficient to fill the output size (3). So, load it into the
> "buffer" (2).
>
> > 2. If data can't fill into the buffer
> >      1. flush the data in the buffer
>
> =This statement seems to suggest that if there is already some data in
> the buffer, it will be wiped. Not recommended!
>
> We check if any data in the buffer if it does, we flush them to the disk (
https://github.com/python/cpython/blob/master/Modules/_io/bufferedio.c#L1948
)

> =Have replaced the next steps, see below for your consideration:-
>
> >          1. If succeed:
> >              1. create a new buffer object.
> >              2. fill the new buffer with data return from raw write
> >          2. If failed:
> >              1. Shifting the buffer to make room for writing data to the
> > buffer
> >              2. Buffer as much writing data as possible (may raise
> > BlockingIOError)
> >      2. return the data
>
> After above transfer from data-source (1) to "buffer" (2):
>
> * if len( data in "buffer" ) >= size (3): output
>         else: keep going
>
> * output:
>         shift size(3) from "buffer" to output
>         retain 'the rest' in/as "buffer"
>
> NB if the size (2) of data in "buffer" is/could be multiples of size
> (3), then the "output" function should/could become a loop, ie keep
> emptying the "buffer" until size (2) < size (3).
>
>
> Finally, don't forget the special cases:
> What happens if we reach 'the end' (of 'input' or 'output' phase), and
> there is still data in (1) or (2)?
> Presumably, in "Read" we would discard (1), but in the case of "Write"
> we MUST empty "buffer" (2), even if it means the last write is of less
> than size (3).
>
> Yes, you are right, when we are writing data to the buffer and the buffer
is full, we have to flush it.

NB The 'rules' for the latter may vary between use-cases, eg add
> 'stuffing' if the output record MUST be x-bytes long.
>
>
> Hope this helps.
> Do you need to hand-code this stuff though, or is there a better way?
>
I'm trying to write an article for it :D

> --
> Regards =dn
> --
> https://mail.python.org/mailman/listinfo/python-list


Regards

Windson



More information about the Python-list mailing list