How can I create customized classes that have similar properties as 'str'?

Tue Nov 27 08:16:28 EST 2007

On Nov 27, 10:45 am, Steven D'Aprano
<ste... at REMOVE.THIS.cybersource.com.au> wrote:
> On Sun, 25 Nov 2007 02:42:36 -0800, Licheng Fang wrote:
> > I mentioned trigram counting as an illustrative case. In fact, you'll
> > often need to define patterns more complex than that, and tens of
> > megabytes of text may generate millions of them, and I've observed they
> > quickly  ate up the 8G memory of a workstation in a few minutes.
> > Manipulating these patterns can be tricky, you can easily waste a lot of
> > memory without taking extra care. I just thought if I define my pattern
> > class with this 'atom' property, coding efforts could be easier later.
>
> I'm just not getting the same results as you when I try this. I'm finding
> that with no extra effort at all, it just works.
>
> The size of your corpus is not important. Neither is the complexity of
> how you generate the patterns. What's important is the number of patterns
> you produce, and "millions" isn't that huge a number, even without
> interning the strings.
>
> Obviously I'm not running your code, but I can build a dict with millions
> of patterns, from hundreds of megabytes of text, on a PC with just 1GB of
> memory and not run into any serious problems.
>
> I've just processed roughly 240MB of random emails, generating n-grams up
> to length 5. The emails include binary attachments and HTML etc., so I'm
> getting lots of patterns that don't normally exist in natural languages
> (e.g. 71 occurrences of 'qqq', and 5 of 'qqqq'). As I said, my PC has
> only 1GB, and that's being shared with about a dozen other apps (including
> such memory hogs as Firefox).
>
> Results? 64939962 patterns found, of which 17031467 are unique. There's
> paging, yes, and my PC runs a little slow when I try to switch from one
> running application to another, but nothing unusable. Opening a dozen
> YouTube videos at once impacts performance worse.
>
> I can't think what you're doing to use up 8GB of RAM for merely
> "millions" of strings, even if you are keeping two, three, ten redundant
> copies. Assuming an average length of twenty bytes per pattern (you said
> trigrams, but I'd rather over-estimate than under), and even assuming
> that only half the 8GB are available to Python, you should be able to
> store something of the order of one hundred million patterns easily:

My task is identifying sequences of tokens (phrases) that are possible
tranlations of each other from a bilingual corpus. I need to check all
the subsequences of a sentence to get the possible phrase pairs. This
makes the problem different from n-gram counting in that the number of
possible phrases doesn't grow linearly with n, but approximately with
n^2. (n being the sentence length.) My first version of the program
consumes almost twice as much memory as the current one because I
discovered in doing different statistics I was regenerating the
patterns, and the counting dictionaries ended up with duplicated
pattern keys (a == b, yet a is not b). Wouldn't it be convenient if I
can restrict the pattern class to not generate identical instances? So
I can avoid such subtle but significant bugs.

>
> 4 bytes for a pointer plus 20 bytes for the string = 24 bytes
>
> 4*1024**3 / 24 = 178,956,970
>
> (This is a ballpark figure. Real Python strings will have additional
> overhead.)

>
> If you're running into problems with merely millions of patterns, then
> you're doing worse by probably two orders of magnitude.
>
> I don't think that the problem lies where you think it does. If you have
> a dict with millions of keys, it doesn't matter how many times each
> pattern exists in the corpus because the key only exists *once* in the
> dict. Duplicate the dict, or generate it from scratch even, and at worse
> you double the memory used by the keys. And probably not even that.
>
> The only thing I can think of that might explain why you're using so much
> memory is if you are generating *all* the patterns up front, say in a
> list, before adding them to the dict:
>
> # Generate one massive list of patterns containing many duplicates
> patterns = make_patterns(text)
> # returns a massive list like ['fre', 'req', 'equ', 'que' ...]
> d = {}
> for pattern in patterns:
>     d[pattern] = d.get(pattern, 0) + 1
>

No, I wasn't doing that.
BTW, do you think the pattern counting code can avoid hashing the
pattern twice? Is there a way to do that when the dictionary values
are of a primitive type?

> Notice that the real killer in the above algorithm is that you need
> enough storage, not just for the unique patterns, but for EVERY separate
> occurrence of each pattern. Nothing to do with how dicts operate, and
> everything to do with the algorithm you (hypothetically) are using.
>
> If that's what you're doing, then no wonder you're running out of memory.
> With 200MB of text, you have 209715198 trigrams in your list. The
> pointers alone will take almost a gigabyte, assuming 32-bit pointers.
>
> If this is your approach, interning the strings won't save you. You
> almost certainly should change to a lazy approach, and use a generator to
> make each pattern as needed, then thrown away:
>
> def make_patterns(s, n=3):
>     """Yield n-grams."""
>     if len(s) <= n:
>         yield s
>     else:
>         for i in range(len(s)-n+1):
>             yield s[i:i+n]
>
> d = {}
> fp = open('corpus', 'r')
> for line in fp:
>     for word in line.split():
>         for pattern in make_patterns(word.strip()):
>             d[pattern] = d.get(pattern, 0) + 1
>
> Obviously I haven't seen your code and don't actually know what you are
> doing. If my 1GB machine can deal with a dict of 17 million unique keys
> from a corpus of 240MB with no special memory management, your 8GB
> machine should too.
>
> --
> Steven.