How can I create customized classes that have similar properties as 'str'?

Mon Nov 26 21:45:51 EST 2007

On Sun, 25 Nov 2007 02:42:36 -0800, Licheng Fang wrote:

> I mentioned trigram counting as an illustrative case. In fact, you'll
> often need to define patterns more complex than that, and tens of
> megabytes of text may generate millions of them, and I've observed they
> quickly  ate up the 8G memory of a workstation in a few minutes.
> Manipulating these patterns can be tricky, you can easily waste a lot of
> memory without taking extra care. I just thought if I define my pattern
> class with this 'atom' property, coding efforts could be easier later.

I'm just not getting the same results as you when I try this. I'm finding 
that with no extra effort at all, it just works.

The size of your corpus is not important. Neither is the complexity of 
how you generate the patterns. What's important is the number of patterns
you produce, and "millions" isn't that huge a number, even without 
interning the strings.

Obviously I'm not running your code, but I can build a dict with millions 
of patterns, from hundreds of megabytes of text, on a PC with just 1GB of 
memory and not run into any serious problems.

I've just processed roughly 240MB of random emails, generating n-grams up
to length 5. The emails include binary attachments and HTML etc., so I'm
getting lots of patterns that don't normally exist in natural languages
(e.g. 71 occurrences of 'qqq', and 5 of 'qqqq'). As I said, my PC has 
only 1GB, and that's being shared with about a dozen other apps (including
such memory hogs as Firefox).

Results? 64939962 patterns found, of which 17031467 are unique. There's 
paging, yes, and my PC runs a little slow when I try to switch from one 
running application to another, but nothing unusable. Opening a dozen 
YouTube videos at once impacts performance worse.

I can't think what you're doing to use up 8GB of RAM for merely 
"millions" of strings, even if you are keeping two, three, ten redundant 
copies. Assuming an average length of twenty bytes per pattern (you said 
trigrams, but I'd rather over-estimate than under), and even assuming 
that only half the 8GB are available to Python, you should be able to 
store something of the order of one hundred million patterns easily:

4 bytes for a pointer plus 20 bytes for the string = 24 bytes

4*1024**3 / 24 = 178,956,970

(This is a ballpark figure. Real Python strings will have additional 
overhead.)

If you're running into problems with merely millions of patterns, then 
you're doing worse by probably two orders of magnitude.

I don't think that the problem lies where you think it does. If you have 
a dict with millions of keys, it doesn't matter how many times each 
pattern exists in the corpus because the key only exists *once* in the 
dict. Duplicate the dict, or generate it from scratch even, and at worse 
you double the memory used by the keys. And probably not even that.

The only thing I can think of that might explain why you're using so much
memory is if you are generating *all* the patterns up front, say in a 
list, before adding them to the dict:

# Generate one massive list of patterns containing many duplicates
patterns = make_patterns(text)
# returns a massive list like ['fre', 'req', 'equ', 'que' ...]
d = {}
for pattern in patterns:
    d[pattern] = d.get(pattern, 0) + 1

Notice that the real killer in the above algorithm is that you need 
enough storage, not just for the unique patterns, but for EVERY separate 
occurrence of each pattern. Nothing to do with how dicts operate, and 
everything to do with the algorithm you (hypothetically) are using.

If that's what you're doing, then no wonder you're running out of memory. 
With 200MB of text, you have 209715198 trigrams in your list. The 
pointers alone will take almost a gigabyte, assuming 32-bit pointers.

If this is your approach, interning the strings won't save you. You 
almost certainly should change to a lazy approach, and use a generator to 
make each pattern as needed, then thrown away:

def make_patterns(s, n=3):
    """Yield n-grams."""
    if len(s) <= n:
        yield s
    else:
        for i in range(len(s)-n+1):
            yield s[i:i+n]

d = {}
fp = open('corpus', 'r')
for line in fp:
    for word in line.split():
        for pattern in make_patterns(word.strip()):
            d[pattern] = d.get(pattern, 0) + 1

Obviously I haven't seen your code and don't actually know what you are 
doing. If my 1GB machine can deal with a dict of 17 million unique keys 
from a corpus of 240MB with no special memory management, your 8GB 
machine should too.

-- 
Steven.