[Tutor] multiprocessing question
eryksun
eryksun at gmail.com
Tue Nov 25 06:41:55 CET 2014
On Sun, Nov 23, 2014 at 7:20 PM, Cameron Simpson <cs at zip.com.au> wrote:
>
> A remark about the create_lookup() function on pastebin: you go:
>
> record_start += len(line)
>
> This presumes that a single text character on a line consumes a single byte
> or memory or file disc space. However, your data file is utf-8 encoded, and
> some characters may be more than one byte or storage. This means that your
> record_start values will not be useful because they are character counts,
> not byte counts, and you need byte counts to offset into a file if you are
> doing random access.
mmap.readline returns a byte string, so len(line) is a byte count.
That said, CsvIter._get_row_lookup shouldn't use the mmap
object. Limit its use to __getitem__.
In CsvIter.__getitem__, I don't see the need to wrap the line in a
filelike object. It's clearly documented that csv.reader takes an
iterable object, such as a list. For example:
# 2.x csv lacks unicode support
line = self.data[start:end].strip()
row = next(csv.reader([line]))
return [cell.decode('utf-8') for cell in row]
# 3.x csv requires unicode
line = self.data[start:end].strip()
row = next(csv.reader([line.decode('utf-8')]))
return row
CsvIter._get_row_lookup should work on a regular file from built-in
open (not codecs.open), opened in binary mode. I/O on a regular file
will release the GIL back to the main thread. mmap objects don't do
this.
Binary mode ensures the offsets are valid for use with
the mmap object in __getitem__. This requires an ASCII compatible
encoding such as UTF-8.
Also, iterate in a for loop instead of calling readline in a while loop.
2.x file.__next__ uses a read-ahead buffer to improve performance.
To see this, check tell() in a for loop.
More information about the Tutor
mailing list