[pypy-dev] A simple file reading is 2x slow wrt CPython

Ozan Çağlayan ozancag at gmail.com
Mon Jun 29 17:12:49 CEST 2015


Hello all,

Well I am searching my dream scientific language :)

The current codebase that I am working with is related to a language
translation software written in C++. I wanted to re-implement parts of
it in Python and/or Julia to both learn it (as I didn't write the C++
stuff) and maybe to make it available for other people who are
interested.

I saw Pyston last night then I came back to PyPy.

As a first step, I tried to parse a 300MB structed text file
containing 1.1M lines like these:

0 ||| I love you mother . ||| label=number1 number2 number3 number4
label2=number5 number6 ... number19 ||| number20

Line-by-line accessing was actually pretty fast *but* trying to store
the lines in a Python list drains RAM on my 4G laptop. This is
disappointing. A raw text file (utf-8) of 300MB takes more than 1GB of
memory.

Today I went through pypy and did some benchmarks. Line parsing is as follows:
- Split it from " ||| "
- Convert 1st field to int and 4rd field to float.
- Cleanup label= stuff from 2nd field using re.sub()
- Append a dict(1) or a class(2) representing each line to a list.

# Dict(1):
#   PyPy: ~1.4G RAM, ~12.7 seconds
#   CPython: ~1.2G RAM, 28.7 seconds

# Class(2):
#   PyPy: ~1.2G, ~11.1 seconds
#   CPython: ~1.3G, ~32 seconds

The memory measurements are not precise as I tracked them visually using top :)
Attaching the code. I'm not an optimization guru, I'm pretty sure that
there are suboptimal parts in the code. But the crucial part is memory
complexity. Normally those text files are ~1GB on disk this means that
I can't represent them in-memory with Python with this code. This is
bad. Any suggestions?

Thanks!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: NBest.py
Type: text/x-python
Size: 2544 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20150629/ec0a5a69/attachment-0001.py>


More information about the pypy-dev mailing list