How can I create customized classes that have similar properties as 'str'?

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Sat Nov 24 16:59:50 EST 2007


On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote:

> On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet-
> mail-0306.20.chr0n... at spamgourmet.com> wrote:
>> Licheng Fang wrote:
>> > I find myself frequently in need of classes like this for two
>> > reasons. First, it's efficient in memory.
>>
>> Are you using millions of objects, or MB size objects? Otherwise, this
>> is no argument.
> 
> Yes, millions. 


Oh noes!!! Not millions of words!!!! That's like, oh, a few tens of 
megabytes!!!!1! How will a PC with one or two gigabytes of RAM cope?????

Tens of megabytes is not a lot of data.

If the average word size is ten characters, then one million words takes 
ten million bytes, or a little shy of ten megabytes. Even if you are 
using four-byte characters, you've got 40 MB, still a moderate amount of 
data on a modern system.


> In my natural language processing tasks, I almost always
> need to define patterns, identify their occurrences in a huge data, and
> count them. Say, I have a big text file, consisting of millions of
> words, and I want to count the frequency of trigrams:
> 
> trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]
> 
> I can save the counts in a dict D1. Later, I may want to recount the
> trigrams, with some minor modifications, say, doing it on every other
> line of the input file, and the counts are saved in dict D2. Problem is,
> D1 and D2 have almost the same set of keys (trigrams of the text), yet
> the keys in D2 are new instances, even though these keys probably have
> already been inserted into D1. So I end up with unnecessary duplicates
> of keys. And this can be a great waste of memory with huge input data.

All these keys will almost certainly add up to only a few hundred 
megabytes, which is a reasonable size of data but not excessive. This 
really sounds to me like a case of premature optimization. I think you 
are wasting your time solving a non-problem.



[snip]
> Wow, I didn't know this. But exactly how Python manage these strings? My
> interpretator gave me such results:
> 
>>>> a = 'this'
>>>> b = 'this'
>>>> a is b
> True
>>>> a = 'this is confusing'
>>>> b = 'this is confusing'
>>>> a is b
> False


It's an implementation detail. You shouldn't use identity testing unless 
you actually care that two names refer to the same object, not because 
you want to save a few bytes. That's poor design: it's fragile, 
complicated, and defeats the purpose of using a high-level language like 
Python.




-- 
Steven.



More information about the Python-list mailing list