[Tutor] multiprocessing question

Tue Nov 25 06:41:55 CET 2014

On Sun, Nov 23, 2014 at 7:20 PM, Cameron Simpson <cs at zip.com.au> wrote:
>
> A remark about the create_lookup() function on pastebin: you go:
>
>  record_start += len(line)
>
> This presumes that a single text character on a line consumes a single byte
> or memory or file disc space. However, your data file is utf-8 encoded, and
> some characters may be more than one byte or storage. This means that your
> record_start values will not be useful because they are character counts,
> not byte counts, and you need byte counts to offset into a file if you are
> doing random access.

mmap.readline returns a byte string, so len(line) is a byte count.
That said, CsvIter._get_row_lookup shouldn't use the mmap
object. Limit its use to __getitem__.

In CsvIter.__getitem__, I don't see the need to wrap the line in a
filelike object. It's clearly documented that csv.reader takes an
iterable object, such as a list. For example:

    # 2.x csv lacks unicode support
    line = self.data[start:end].strip()
    row = next(csv.reader([line]))
    return [cell.decode('utf-8') for cell in row]

    # 3.x csv requires unicode
    line = self.data[start:end].strip()
    row = next(csv.reader([line.decode('utf-8')]))
    return row

CsvIter._get_row_lookup should work on a regular file from built-in
open (not codecs.open), opened in binary mode. I/O on a regular file
will release the GIL back to the main thread. mmap objects don't do
this.

Binary mode ensures the offsets are valid for use with
the mmap object in __getitem__. This requires an ASCII compatible
encoding such as UTF-8.

Also, iterate in a for loop instead of calling readline in a while loop.
2.x file.__next__ uses a read-ahead buffer to improve performance.
To see this, check tell() in a for loop.