How can I create customized classes that have similar properties as 'str'?

Sun Nov 25 05:42:36 EST 2007

On Nov 25, 5:59 am, Steven D'Aprano <st... at REMOVE-THIS-
cybersource.com.au> wrote:
> On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote:
> > On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet-
> > mail-0306.20.chr0n... at spamgourmet.com> wrote:
> >> Licheng Fang wrote:
> >> > I find myself frequently in need of classes like this for two
> >> > reasons. First, it's efficient in memory.
>
> >> Are you using millions of objects, or MB size objects? Otherwise, this
> >> is no argument.
>
> > Yes, millions.
>
> Oh noes!!! Not millions of words!!!! That's like, oh, a few tens of
> megabytes!!!!1! How will a PC with one or two gigabytes of RAM cope?????
>
> Tens of megabytes is not a lot of data.
>
> If the average word size is ten characters, then one million words takes
> ten million bytes, or a little shy of ten megabytes. Even if you are
> using four-byte characters, you've got 40 MB, still a moderate amount of
> data on a modern system.

I mentioned trigram counting as an illustrative case. In fact, you'll
often need to define patterns more complex than that, and tens of
megabytes of text may generate millions of them, and I've observed
they quickly  ate up the 8G memory of a workstation in a few minutes.
Manipulating these patterns can be tricky, you can easily waste a lot
of memory without taking extra care. I just thought if I define my
pattern class with this 'atom' property, coding efforts could be
easier later.

>
> > In my natural language processing tasks, I almost always
> > need to define patterns, identify their occurrences in a huge data, and
> > count them. Say, I have a big text file, consisting of millions of
> > words, and I want to count the frequency of trigrams:
>
> > trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]
>
> > I can save the counts in a dict D1. Later, I may want to recount the
> > trigrams, with some minor modifications, say, doing it on every other
> > line of the input file, and the counts are saved in dict D2. Problem is,
> > D1 and D2 have almost the same set of keys (trigrams of the text), yet
> > the keys in D2 are new instances, even though these keys probably have
> > already been inserted into D1. So I end up with unnecessary duplicates
> > of keys. And this can be a great waste of memory with huge input data.
>
> All these keys will almost certainly add up to only a few hundred
> megabytes, which is a reasonable size of data but not excessive. This
> really sounds to me like a case of premature optimization. I think you
> are wasting your time solving a non-problem.
>
> [snip]
>
> > Wow, I didn't know this. But exactly how Python manage these strings? My
> > interpretator gave me such results:
>
> >>>> a = 'this'
> >>>> b = 'this'
> >>>> a is b
> > True
> >>>> a = 'this is confusing'
> >>>> b = 'this is confusing'
> >>>> a is b
> > False
>
> It's an implementation detail. You shouldn't use identity testing unless
> you actually care that two names refer to the same object, not because
> you want to save a few bytes. That's poor design: it's fragile,
> complicated, and defeats the purpose of using a high-level language like
> Python.
>
> --
> Steven.