Understand workflow about reading and writing files in Python

Windson Yang wiwindson at gmail.com
Mon Jun 24 19:50:33 EDT 2019


DL Neil <PythonList at danceswithmice.info> 于2019年6月24日周一 上午11:18写道:

> Yes, better to reply to list - others may 'jump in'...
>
>
> On 20/06/19 5:37 PM, Windson Yang wrote:
> > Thank you so much for you review DL Neil, it really helps :D. However,
> > there are some parts still confused me, I replyed as below.
>
> It's not a particularly easy topic...
>
>
> > DL Neil <PythonList at danceswithmice.info
> > <mailto:PythonList at danceswithmice.info>> 于2019年6月19日周三 下午2:03写道:
> >
> >     I've not gone 'back' to refer to any ComSc theory on
> buffer-management.
> >     Perhaps you might benefit from such?
> >
> > I just take a crash course on it so I want to know if I understand the
> > details correctly :D
>
> ...there are so many ways one can mess-up!
>
>
> >     I like your use of the word "shift", so I'll continue to use it.
> >
> >     There are three separate units of data to consider - each of which
> >     could
> >     be called a "buffer". To avoid confusing (myself) I'll only call the
> >     'middle one' that:
> >     1 the unit of data 'coming' from the data-source
> >     2 the "buffer" you are implementing
> >     3 the unit of data 'going' out to a data-destination.
> >
> > Just to make it clear, when we use `f.write('abc')` in python, (1) means
> > 'abc', (2) means the buffer handle by Python (by default 8kb), (2) means
> > the file *f* we are writing to, right?
>
> Sorry, this is my typo, (3) means the file *f* we are writing to, right?


> No! (sorry) f.write() is an output operation, thus nr3.
>
> "f" is not a "buffer handle" but a "file handle" or more accurately a
> "file object".
>
> When we:
>
>         one_input = f.read( NRbytes )
>
> (ignoring EOF/short file and other exceptions) that many bytes will
> 'appear' in our program labelled as "one_input".
>
> However, the OpSys may have read considerably more data, depending upon
> the device(s) involved, the application, etc; eg if we ask for 2 bytes
> the operating system will read a much larger block (or applicable unit)
> of data from a disk drive.
>
> The same applies in reverse, with f.write( NRbytes/byte-object ), until
> we flush or close the file.
>
> Those situations account for nr1 and nr3. In the usual case, we have no
> control over the size of these buffers - and it is best not to meddle!
>
> I agreed with you.

Hence:-
>
> >     1 and 3 may be dictated to you, eg hardware or file specifications,
> >     code
> >     requirements, etc.
> >
> >     So, data is shifted into the (2) buffer in a unit-size decided by
> (1) -
> >     in most use-cases each incoming unit will be the same size, but
> >     remember
> >     that the last 'unit' may/not be full-size. Similarly, data shifted
> out
> >     from the (2) buffer to (3).
> >
> >     The size of (1) is likely not that of (3) - otherwise why use a
> >     "buffer"? The size of (2) must be larger than (1) and larger than
> (2) -
> >     for reasons already illustrated.
> >
> > Is this a typo? (2) larger than (1) larger than (2)?
>
> Correct - well spotted! nr2 > nr1 and nr2 > nr3
>

When we run 'f.write(100', I understand why nr2 (by defaut 8kb) > nr1
(100), but I'm not sure why nr2 > nr3 (file object) here?

>
>
> >     I recall learning how to use buffers with a series of hand-drawn
> block
> >     diagrams. Recommend you try similarly!
>
> Try this!
>
>
> >     Now, let's add a few critiques, as requested (interposed below):-
> >
> >
> >     On 19/06/19 3:53 PM, Windson Yang wrote:t
> >      > I'm trying to understand the workflow of how Python read/writes
> >     data with
> >      > buffer. I will be appreciated if someone can review it.
> >      >
> >      > ### Read n data
> >
> >     - may need more than one read operation if the size of (3) "demands"
> >     more data than the size of (1)/one "read".
> >
> >
> > Looks like the size of len of one read() depends on
> >
> https://github.com/python/cpython/blob/master/Modules/_io/bufferedio.c#L1655
>  ?
>
>
> You decide how many bytes should be read. That's how much will be
> transferred from the OpSys' I/O into the Python program's space. With
> the major exception, that if there is no (more) data available, it is
> defined as an exception (EOF = end of file) or if there are fewer bytes
> of data than requested (in which case you will be given only the number
> of bytes of data-available.
>
>
> >      > 1. If the data already in the buffer, return data
> >
> >     - this a data-transfer of size (3)
> >
> >     For extra credit/an unnecessary complication (but probable
> speed-up!):
> >     * if the data-remaining is less than size (3) consider a read-ahead
> >     mechanism
> >
> >      > 2. If the data not in the buffer:
> >
> >     - if buffer's data-len < size (3)
> >
> >      >      1. copy all the current data from the buffer
> >
> >     * if "buffer" is my (2), then no-op
> >
> > I don't understand your point here, when we read data we would copy some
> > data from the current buffer from python, right?
> > (
> https://github.com/python/cpython/blob/master/Modules/_io/bufferedio.c#L1638),
>
> > we use `out` (which point to res) to store the data here.
>
> We're becoming confused: the original heading 'here' was "### Read n
> data" which is inconsistent with "out" and "from python".
>
>
> If the read operation is set to transfer (say) 2KB into the program at a
> time, but the code processes it in 100B units, then it would seem that
> after the first read, twenty process loops will run before it is
> necessary to issue another input request.
>
> In that example, the buffer (nr2) is twenty-times the length of the
> input 'buffer' (nr1).
>
> So, from the second to the twentieth iteration of the process, your
> step-1 "1. If the data already in the buffer, return data" (and thus my
> "no-op) applies!
>
> This is a major advantage of having a buffer in the first place -
> transfers within RAM are significantly faster than I/O operations!
>
> Yes, that is what I trying to say. Looks like I should add more details
for the code.

>
> >      >      2. create a new buffer object, fill the new buffer with raw
> >     read which
> >      > read data from disk.
> >
> >     * this becomes: perform read operation and append incoming data (size
> >     (1)) to "buffer" - hence why "buffer" is larger than (1), by
> definition.
> >     NB if size (1) is smaller than size (3), multiple read operations
> >     may be
> >     necessary. Thus a read-loop!?
> >
> > Yes, you are right, here is a while loop
> > (
> https://github.com/python/cpython/blob/master/Modules/_io/bufferedio.c#L1652)
>
> >
> >
> >
> >      >      3. concat the data in the old buffer and new buffer.
> >
> >     = now no-op. Hopefully the description of 'three buffers' removes
> this
> >     confusion of/between buffers.
> >
> >   I don't get it. When we call the function like seek(0) then
> > read(1000), we can still use the data from buffer from python, right?
>
> I fear that we are having terminology issues - see the original
> description of three 'buffers'. Which "buffer" are you talking about?
> 1 the seek/read are carried-out against a file object, which will indeed
> have its own buffer, size unknown to Python. (buffer 1)
> 2 the read(1000) operation will (on its own) allow you to populate a
> buffer within your code, 1000-bytes in length. (buffer 2)
>
> Is the file object in has its own buffer?  Does it only happen when we use
Standard I/O (FILE*)? I'm not sure I used it in CPython, or maybe I missed
something.

>
> >      >      4. return the data
> >
> >     * make the above steps into a while-loop and there won't be a
> separate
> >     step here (it is the existing step 1!)
> >
> >
> >     * build all of the above into a function/method, so that the
> 'mainline'
> >     only has to say 'give me data'!
> >
> >
> >      > ### Write n data
> >      > 1. If data small enough to fill into the buffer, write data to
> >     the buffer
> >
> >     =yes, the data coming from source (1), which in this case is 'your'
> >     code
> >     may/not be sufficient to fill the output size (3). So, load it into
> the
> >     "buffer" (2).
> >
> >      > 2. If data can't fill into the buffer
> >      >      1. flush the data in the buffer
> >
> >     =This statement seems to suggest that if there is already some data
> in
> >     the buffer, it will be wiped. Not recommended!
> >
> > We check if any data in the buffer if it does, we flush them to the disk
> > (
> https://github.com/python/cpython/blob/master/Modules/_io/bufferedio.c#L1948)
>
> >
> >
> >     =Have replaced the next steps, see below for your consideration:-
> >
> >      >          1. If succeed:
> >      >              1. create a new buffer object.
> >      >              2. fill the new buffer with data return from raw
> write
> >      >          2. If failed:
> >      >              1. Shifting the buffer to make room for writing data
> >     to the
> >      > buffer
> >      >              2. Buffer as much writing data as possible (may raise
> >      > BlockingIOError)
> >      >      2. return the data
> >
> >     After above transfer from data-source (1) to "buffer" (2):
> >
> >     * if len( data in "buffer" ) >= size (3): output
> >              else: keep going
> >
> >     * output:
> >              shift size(3) from "buffer" to output
> >              retain 'the rest' in/as "buffer"
> >
> >     NB if the size (2) of data in "buffer" is/could be multiples of size
> >     (3), then the "output" function should/could become a loop, ie keep
> >     emptying the "buffer" until size (2) < size (3).
> >
> >
> >     Finally, don't forget the special cases:
> >     What happens if we reach 'the end' (of 'input' or 'output' phase),
> and
> >     there is still data in (1) or (2)?
> >     Presumably, in "Read" we would discard (1), but in the case of
> "Write"
> >     we MUST empty "buffer" (2), even if it means the last write is of
> less
> >     than size (3).
> >
> > Yes, you are right, when we are writing data to the buffer and the
> > buffer is full, we have to flush it.
> >
> >     NB The 'rules' for the latter may vary between use-cases, eg add
> >     'stuffing' if the output record MUST be x-bytes long.
> >
> >
> >     Hope this helps.
> >     Do you need to hand-code this stuff though, or is there a better way?
> >
> > I'm trying to write an article for it :D
>
>
> Perhaps it would help to discuss the use-case you will use as the
> article's example.
>
> "I take a crash course" cf "write an article"???
>
>
> Web-Refs:
>
> Wikipedia: https://en.wikipedia.org/wiki/Data_buffer
>
> The PSL's IO library (?the code you've been reading):
>
> https://docs.python.org/3.6/library/io.html?highlight=buffer#io.TextIOBase.buffer
>
> The PSL's Readline library (which may be easier to visualise for
> desktop-type users/coders - unless you're into IoT applications and
> similar)
> https://docs.python.org/3.6/library/readline.html?highlight=buffer
>
> PSL's Buffer protocol, in case you really want to 're-invent the wheel',
> but with some possibly-helpful explanation:
> https://docs.python.org/3.6/c-api/buffer.html?highlight=buffer
>
>
> --
> Regards =dn
> --
> https://mail.python.org/mailman/listinfo/python-list
>



More information about the Python-list mailing list