memory usage

Thu May 8 05:59:10 EDT 2003

On 03-May-07 14:16, John Machin wrote:
> Nagy Gabor <linux42 at freemail.c3.hu> wrote in message news:<mailman.1052227330.15773.python-list at python.org>...
> > I wrote a simple datafile parser, and it is quite memory hungry, and I
> > don't know if this is what I should expect, or is there a bug in my code.
> [snip]
> > def ParseFields():
> >   fields = []
> >   for ...:
> >     Data = StringIO.read( length)
> 
> You probably have trailing spaces or leading zeroes here; if so, this
> would be adding to your memory problem.

Yes, I have some, but the vast majority of the memory consumption was not
the data stored, but the objects themselves.

> >     tmp = TD()
> >     tmp.Tag = T(name = 'name')
> >     tmp.Data = Data
> >     fields.append(tmp)
> 
> So for each field, you create a TD instance. One of the attributes of
> this TD instance is a *NEW* T instance. This is *TWO* *NEW* class
> instances per field. I repeat, PER FIELD.

Yes.

> You don't say what, if anything, you are doing with the Flag, Class,
> and Name attributes of the T-instance. Given that you have field.Data,
> what is field.Tag.Value? Does *every* instance of a field need a
> field.Tag.Name? You should exploit the (presumed) homogeneity of your
> data by factoring out the data description to a higher level e.g. one
> per column per table, instead of one per field. E.g. all the
> descriptive info for the 23rd column/field in table/record-type "XYZ"
> can be found in field_description["XYZ"][22] i.e. field_description is
> a dictionary of lists. You would also build an inverted index to get
> the field number (e.g. 22) from the field name (e.g.
> "salary_of_data_modeller")

Since the memory wasted does not come from the data I wanted to store, I
did not try to make everything so complex.

> You should also consider combining the T and TD classes -- it is not
> apparent from your code if the current separation achieves anything
> positive; negatives include waste of memory and CPU, plus visual and
> mental clutter.

Actually, I have combined the T and TD classes. Thus creating only one new
class instance per field gave me 122M memory usage (so about the half,
what I had before).

I have omitted every class instance now, and store every data in a tuple,
and the memory usage has dropped to 43M.

Anyway, when I learnt OO paradigm, I learnt that if I think of something
as a different "thing", I should create a class for that.

Right now, T contains information about the encoding/representation of the
field in the source file (which is needed to display or to recode it), and
TD contained the field data (and T).
To save memory, now I have only one TD object per field, containing
everything, which I don't really like. I think this is visual and mental
clutter. For example there are data from different sources, where T would
not present. Now it will always be there. I don't like this, but if it
cuts memory requirement in half, OK.

Alas, the tuple thing I cannot use (which would anyway be a complete throw
away of the whole OO paradigm), as this particular source and format had a
lowest level with 600k fields, but other sources and formats don't have a
common lowest level.

So I have to waste 2 times the memory required if I go on using 1 class,
or I could try to rewrite everything, and store the leaf nodes in tuples.
Maybe I'll to this.

> Of course you could avoid the effort involved in a design strategy
> rethink and get some big-enough tactical wins by (1) using __slots__
> [mentioned by others, but they didn't remind you that this works only
> with new-style classes; __slots__ is silently ignored in a classic
> class]

Actually I found out that using new style classes took about 1/4 of the
memory thang an old style one, (when empty), and this gave me about 30M.
Using __slots__ in a new style class did not give me any more memory gain.

> Looking at the other dimension of your MxN problem, why do you think
> you need to keep each row/record in memory?

Because you cannot tell in which object will a newly read field be
attached to.
So I cannot start the writeout until the whole source is read.
Generally. Some formats are special, of course. Usually I could write out
subtrees, as the fields are usually sorted, so the fields related to an
object come after each other.

Regards,
G