Understand workflow about reading and writing files in Python

Tue Jun 25 16:34:51 EDT 2019

On 25/06/19 11:50 AM, Windson Yang wrote:
> DL Neil <PythonList at danceswithmice.info 
> <mailto:PythonList at danceswithmice.info>> 于2019年6月24日周一 上午11:18写道：
>     Yes, better to reply to list - others may 'jump in'...
>     On 20/06/19 5:37 PM, Windson Yang wrote:
>      > Thank you so much for you review DL Neil, it really helps :D.
>     However,
>      > there are some parts still confused me, I replyed as below.
>     It's not a particularly easy topic...
>      > DL Neil <PythonList at danceswithmice.info
>     <mailto:PythonList at danceswithmice.info>
>      > <mailto:PythonList at danceswithmice.info
>     <mailto:PythonList at danceswithmice.info>>> 于2019年6月19日周三 下午
>     2:03写道：
...

>     ...there are so many ways one can mess-up!
>      >     I like your use of the word "shift", so I'll continue to use it.
>      >     There are three separate units of data to consider - each of
>     which
>      >     could
>      >     be called a "buffer". To avoid confusing (myself) I'll only
>     call the
>      >     'middle one' that:
>      >     1 the unit of data 'coming' from the data-source
>      >     2 the "buffer" you are implementing
>      >     3 the unit of data 'going' out to a data-destination.
>      >
>      > Just to make it clear, when we use `f.write('abc')` in python,
>     (1) means
>      > 'abc', (2) means the buffer handle by Python (by default 8kb),
>     (2) means
>      > the file *f* we are writing to, right?
> Sorry, this is my typo, (3) means the file *f* we are writing to, right?

Correct.
- to avoid exactly this sort of confusion it is best NOT to use short 
names but to (carefully) choose meaningful variable-names!

How about:
1 "input_file" (I'd prefer to add context, eg "expenses_file")
and
3 "output_file" (again... "expenses_summary")

>     No! (sorry) f.write() is an output operation, thus nr3.
> 
>     "f" is not a "buffer handle" but a "file handle" or more accurately a
>     "file object".
> 
>     When we:
> 
>              one_input = f.read( NRbytes )
> 
>     (ignoring EOF/short file and other exceptions) that many bytes will
>     'appear' in our program labelled as "one_input".
> 
>     However, the OpSys may have read considerably more data, depending upon
>     the device(s) involved, the application, etc; eg if we ask for 2 bytes
>     the operating system will read a much larger block (or applicable unit)
>     of data from a disk drive.
> 
>     The same applies in reverse, with f.write( NRbytes/byte-object ), until
>     we flush or close the file.
> 
>     Those situations account for nr1 and nr3. In the usual case, we have no
>     control over the size of these buffers - and it is best not to meddle!
> 
> I agreed with you.
> 
>     Hence:-
> 
>      >     1 and 3 may be dictated to you, eg hardware or file
>     specifications,
>      >     code
>      >     requirements, etc.
>      >
>      >     So, data is shifted into the (2) buffer in a unit-size
>     decided by (1) -
>      >     in most use-cases each incoming unit will be the same size, but
>      >     remember
>      >     that the last 'unit' may/not be full-size. Similarly, data
>     shifted out
>      >     from the (2) buffer to (3).
>      >
>      >     The size of (1) is likely not that of (3) - otherwise why use a
>      >     "buffer"? The size of (2) must be larger than (1) and larger
>     than (2) -
>      >     for reasons already illustrated.
>      >
>      > Is this a typo? (2) larger than (1) larger than (2)?
> 
>     Correct - well spotted! nr2 > nr1 and nr2 > nr3
> 
> 
> When we run 'f.write(100', I understand why nr2 (by defaut 8kb) > nr1 
> (100), but I'm not sure why nr2 > nr3 (file object) here?

The program's internal string or data-structure MUST be as large or 
larger than the output size.

If the output file requires three fields of four kilobytes each, that's 
12KB. If the internal buffer is only 1KB in length, how will it satisfy 
the 12KB specification?

That said, a classic 'out by one' error. The earlier text should have 
said "len(nr2) >= len(nr3)"!

>      >     I recall learning how to use buffers with a series of
>     hand-drawn block
>      >     diagrams. Recommend you try similarly!
> 
>     Try this!
> 
> 
>      >     Now, let's add a few critiques, as requested (interposed below):-
>      >
>      >
>      >     On 19/06/19 3:53 PM, Windson Yang wrote:t
>      >      > I'm trying to understand the workflow of how Python
>     read/writes
>      >     data with
>      >      > buffer. I will be appreciated if someone can review it.
>      >      >
>      >      > ### Read n data
>      >
>      >     - may need more than one read operation if the size of (3)
>     "demands"
>      >     more data than the size of (1)/one "read".
>      >
>      >
>      > Looks like the size of len of one read() depends on
>      >
>     https://github.com/python/cpython/blob/master/Modules/_io/bufferedio.c#L1655 ?
> 
> 
>     You decide how many bytes should be read. That's how much will be
>     transferred from the OpSys' I/O into the Python program's space. With
>     the major exception, that if there is no (more) data available, it is
>     defined as an exception (EOF = end of file) or if there are fewer bytes
>     of data than requested (in which case you will be given only the number
>     of bytes of data-available.
> 
> 
>      >      > 1. If the data already in the buffer, return data
>      >
>      >     - this a data-transfer of size (3)
>      >
>      >     For extra credit/an unnecessary complication (but probable
>     speed-up!):
>      >     * if the data-remaining is less than size (3) consider a
>     read-ahead
>      >     mechanism
>      >
>      >      > 2. If the data not in the buffer:
>      >
>      >     - if buffer's data-len < size (3)
>      >
>      >      >      1. copy all the current data from the buffer
>      >
>      >     * if "buffer" is my (2), then no-op
>      >
>      > I don't understand your point here, when we read data we would
>     copy some
>      > data from the current buffer from python, right?
>      >
>     (https://github.com/python/cpython/blob/master/Modules/_io/bufferedio.c#L1638),
> 
>      > we use `out` (which point to res) to store the data here.
> 
>     We're becoming confused: the original heading 'here' was "### Read n
>     data" which is inconsistent with "out" and "from python".
> 
> 
>     If the read operation is set to transfer (say) 2KB into the program
>     at a
>     time, but the code processes it in 100B units, then it would seem that
>     after the first read, twenty process loops will run before it is
>     necessary to issue another input request.
> 
>     In that example, the buffer (nr2) is twenty-times the length of the
>     input 'buffer' (nr1).
> 
>     So, from the second to the twentieth iteration of the process, your
>     step-1 "1. If the data already in the buffer, return data" (and thus my
>     "no-op) applies!
> 
>     This is a major advantage of having a buffer in the first place -
>     transfers within RAM are significantly faster than I/O operations!
> 
> Yes, that is what I trying to say. Looks like I should add more details 
> for the code.

Code? Didn't you talk about an article?

If you are coding, please see earlier comment (and web-refs) to save 
your time and the pitfalls inherent in 're-inventing the wheel'.

>      >      >      2. create a new buffer object, fill the new buffer
>     with raw
>      >     read which
>      >      > read data from disk.
>      >
>      >     * this becomes: perform read operation and append incoming
>     data (size
>      >     (1)) to "buffer" - hence why "buffer" is larger than (1), by
>     definition.
>      >     NB if size (1) is smaller than size (3), multiple read operations
>      >     may be
>      >     necessary. Thus a read-loop!?
>      >
>      > Yes, you are right, here is a while loop
>      >
>     (https://github.com/python/cpython/blob/master/Modules/_io/bufferedio.c#L1652)
> 
>      >
>      >
>      >
>      >      >      3. concat the data in the old buffer and new buffer.
>      >
>      >     = now no-op. Hopefully the description of 'three buffers'
>     removes this
>      >     confusion of/between buffers.
>      >
>      >   I don't get it. When we call the function like seek(0) then
>      > read(1000), we can still use the data from buffer from python, right?
> 
>     I fear that we are having terminology issues - see the original
>     description of three 'buffers'. Which "buffer" are you talking about?
>     1 the seek/read are carried-out against a file object, which will
>     indeed
>     have its own buffer, size unknown to Python. (buffer 1)
>     2 the read(1000) operation will (on its own) allow you to populate a
>     buffer within your code, 1000-bytes in length. (buffer 2)
> 
> Is the file object in has its own buffer?  Does it only happen when we 
> use Standard I/O (FILE*)? I'm not sure I used it in CPython, or maybe I 
> missed something.

Yes, the standard Python facilities/APIs provide access to or otherwise 
supplement OpSys interfaces (and thus the hardware (or whatever) buffer).

NB the i/o buffer (nrs1 and 3) may not be 'addressable' from Python 
code, other than by a read/write operation (and similar). All we do is 
read/write (etc) and Python + OpSys do the rest: ie take care of 'the 
gory details'.

If you want to get 'down-and-dirty', I suspect you'll need to move to 
another (lower-level) language(?)

>      >      >      4. return the data
>      >
>      >     * make the above steps into a while-loop and there won't be a
>     separate
>      >     step here (it is the existing step 1!)
>      >
>      >
>      >     * build all of the above into a function/method, so that the
>     'mainline'
>      >     only has to say 'give me data'!
>      >
>      >
>      >      > ### Write n data
>      >      > 1. If data small enough to fill into the buffer, write data to
>      >     the buffer
>      >
>      >     =yes, the data coming from source (1), which in this case is
>     'your'
>      >     code
>      >     may/not be sufficient to fill the output size (3). So, load
>     it into the
>      >     "buffer" (2).
>      >
>      >      > 2. If data can't fill into the buffer
>      >      >      1. flush the data in the buffer
>      >
>      >     =This statement seems to suggest that if there is already
>     some data in
>      >     the buffer, it will be wiped. Not recommended!
>      >
>      > We check if any data in the buffer if it does, we flush them to
>     the disk
>      >
>     (https://github.com/python/cpython/blob/master/Modules/_io/bufferedio.c#L1948)
> 
>      >
>      >
>      >     =Have replaced the next steps, see below for your consideration:-
>      >
>      >      >          1. If succeed:
>      >      >              1. create a new buffer object.
>      >      >              2. fill the new buffer with data return from
>     raw write
>      >      >          2. If failed:
>      >      >              1. Shifting the buffer to make room for
>     writing data
>      >     to the
>      >      > buffer
>      >      >              2. Buffer as much writing data as possible
>     (may raise
>      >      > BlockingIOError)
>      >      >      2. return the data
>      >
>      >     After above transfer from data-source (1) to "buffer" (2):
>      >
>      >     * if len( data in "buffer" ) >= size (3): output
>      >              else: keep going
>      >
>      >     * output:
>      >              shift size(3) from "buffer" to output
>      >              retain 'the rest' in/as "buffer"
>      >
>      >     NB if the size (2) of data in "buffer" is/could be multiples
>     of size
>      >     (3), then the "output" function should/could become a loop,
>     ie keep
>      >     emptying the "buffer" until size (2) < size (3).
>      >
>      >
>      >     Finally, don't forget the special cases:
>      >     What happens if we reach 'the end' (of 'input' or 'output'
>     phase), and
>      >     there is still data in (1) or (2)?
>      >     Presumably, in "Read" we would discard (1), but in the case
>     of "Write"
>      >     we MUST empty "buffer" (2), even if it means the last write
>     is of less
>      >     than size (3).
>      >
>      > Yes, you are right, when we are writing data to the buffer and the
>      > buffer is full, we have to flush it.
>      >
>      >     NB The 'rules' for the latter may vary between use-cases, eg add
>      >     'stuffing' if the output record MUST be x-bytes long.
>      >
>      >
>      >     Hope this helps.
>      >     Do you need to hand-code this stuff though, or is there a
>     better way?
>      >
>      > I'm trying to write an article for it :D
> 
> 
>     Perhaps it would help to discuss the use-case you will use as the
>     article's example.
> 
>     "I take a crash course" cf "write an article"???
> 
> 
>     Web-Refs:
> 
>     Wikipedia: https://en.wikipedia.org/wiki/Data_buffer
> 
>     The PSL's IO library (?the code you've been reading):
>     https://docs.python.org/3.6/library/io.html?highlight=buffer#io.TextIOBase.buffer
> 
>     The PSL's Readline library (which may be easier to visualise for
>     desktop-type users/coders - unless you're into IoT applications and
>     similar)
>     https://docs.python.org/3.6/library/readline.html?highlight=buffer
> 
>     PSL's Buffer protocol, in case you really want to 're-invent the
>     wheel',
>     but with some possibly-helpful explanation:
>     https://docs.python.org/3.6/c-api/buffer.html?highlight=buffer

-- 
Regards =dn