How can I create customized classes that have similar properties as 'str'?

Tue Nov 27 12:18:49 EST 2007

On Nov 27, 2007 7:16 AM, Licheng Fang <fanglicheng at gmail.com> wrote:
> On Nov 27, 10:45 am, Steven D'Aprano
>
> <ste... at REMOVE.THIS.cybersource.com.au> wrote:
> > On Sun, 25 Nov 2007 02:42:36 -0800, Licheng Fang wrote:
> > > I mentioned trigram counting as an illustrative case. In fact, you'll
> > > often need to define patterns more complex than that, and tens of
> > > megabytes of text may generate millions of them, and I've observed they
> > > quickly  ate up the 8G memory of a workstation in a few minutes.
> > > Manipulating these patterns can be tricky, you can easily waste a lot of
> > > memory without taking extra care. I just thought if I define my pattern
> > > class with this 'atom' property, coding efforts could be easier later.
> >
> > I'm just not getting the same results as you when I try this. I'm finding
> > that with no extra effort at all, it just works.
> >
> > The size of your corpus is not important. Neither is the complexity of
> > how you generate the patterns. What's important is the number of patterns
> > you produce, and "millions" isn't that huge a number, even without
> > interning the strings.
> >
> > Obviously I'm not running your code, but I can build a dict with millions
> > of patterns, from hundreds of megabytes of text, on a PC with just 1GB of
> > memory and not run into any serious problems.
> >
> > I've just processed roughly 240MB of random emails, generating n-grams up
> > to length 5. The emails include binary attachments and HTML etc., so I'm
> > getting lots of patterns that don't normally exist in natural languages
> > (e.g. 71 occurrences of 'qqq', and 5 of 'qqqq'). As I said, my PC has
> > only 1GB, and that's being shared with about a dozen other apps (including
> > such memory hogs as Firefox).
> >
> > Results? 64939962 patterns found, of which 17031467 are unique. There's
> > paging, yes, and my PC runs a little slow when I try to switch from one
> > running application to another, but nothing unusable. Opening a dozen
> > YouTube videos at once impacts performance worse.
> >
> > I can't think what you're doing to use up 8GB of RAM for merely
> > "millions" of strings, even if you are keeping two, three, ten redundant
> > copies. Assuming an average length of twenty bytes per pattern (you said
> > trigrams, but I'd rather over-estimate than under), and even assuming
> > that only half the 8GB are available to Python, you should be able to
> > store something of the order of one hundred million patterns easily:
>
> My task is identifying sequences of tokens (phrases) that are possible
> tranlations of each other from a bilingual corpus. I need to check all
> the subsequences of a sentence to get the possible phrase pairs. This
> makes the problem different from n-gram counting in that the number of
> possible phrases doesn't grow linearly with n, but approximately with
> n^2. (n being the sentence length.) My first version of the program
> consumes almost twice as much memory as the current one because I
> discovered in doing different statistics I was regenerating the
> patterns, and the counting dictionaries ended up with duplicated
> pattern keys (a == b, yet a is not b). Wouldn't it be convenient if I
> can restrict the pattern class to not generate identical instances? So
> I can avoid such subtle but significant bugs.
>

Implement __hash__ and __eq__ on your pattern class. If the same
pattern compares equal and hashes the same then it will be a "matching
key" as far as the dict is concerned and will only be stored once.
This is probably cheaper than explicit interning anyway (you don't
need to search an intern table).

> > The only thing I can think of that might explain why you're using so much
> > memory is if you are generating *all* the patterns up front, say in a
> > list, before adding them to the dict:
> >
> > # Generate one massive list of patterns containing many duplicates
> > patterns = make_patterns(text)
> > # returns a massive list like ['fre', 'req', 'equ', 'que' ...]
> > d = {}
> > for pattern in patterns:
> >     d[pattern] = d.get(pattern, 0) + 1
> >
>
> No, I wasn't doing that.
> BTW, do you think the pattern counting code can avoid hashing the
> pattern twice? Is there a way to do that when the dictionary values
> are of a primitive type?
>

Hashing isn't really an expensive operation. On strings it's even
cached on the object. If you implement your own __hash__ method you
can do the same, but I wouldn't bother unless you benchmark it and
discover that hashing is a bottleneck.