Writing huve ge Sets() to disk

Mon Jan 10 20:38:51 EST 2005

Tim Peters wrote:
> [Tim Peters]
> 
>>>As I mentioned before, if you store keys in sorted text files,
>>>you can do intersection and difference very efficiently just by using
>>>the Unix `comm` utiltity.
> 
> 
> [Martin MOKREJŠ]
> 
>>Now I got your point. I understand the comm(1) is written in C, but it still
>>has to scan file1 once and file2 n-times, where n is a number of lines
>>in file1, right? Without any index ... I'll consider it, actually will test,
>>thanks!
> 
> 
> `comm` is much more efficient than that.  Note that the input files
> have to be sorted.  Then, in a single pass over both files (not 2
> passes, not 3, not n, just 1), it can compute any subset of these
> three (do `man comm`):
> 
> 1. The words common to both files.
> 2. The words unique to "the first" file.
> 3. The words unique to "the second" file.

I read the manpage actually before answering to the list.

> It's essentially just doing a merge on two sorted lists, and how it
> works should be obvious if you give it some thought.  It takes time
> proportional to the sum of the lengths of the input files, and nothing
> *can* be faster than that.

Well, I think I understand now because "Scott David Daniels" wrote
probably the very same logic in python code in this thread.

>>I was really hoping I'll get an answer how to alter the indexes for dictionaries
>>in python.
> 
> 
> Sorry, I don't have a guess for what that might mean.

I'm not an expert, mysql for example givs you ability to index say
first 10 characters of a text column, which are typically varchar.
Just for curiosity I'd like to know how do it in python.

When importing data from a flatfile into mysql table, there's an
option to delay indexing to the very last moment, when all keys are
loaded (it doesn't make sense to re-create index after each new
row into table is added). I believe it's exactly same waste of cpu/io
in this case - when I create a dictionary and fill it with data,
I want to create index afterward, not after every key/value pair
is recorded.

>>You convinced me not to even try to construct to theoretical dictionary,
>>as it will take ages just to create. Even if I'd manage, I couldn't
>>save it (the theoretical and possibly not even the dict(theoret) - dict(real)
>>result).
> 
> 
> Worse, it didn't sound like a good approach even if you could save it.
> 
> 
>>Still, before I give the whole project, once more:
>>
>>I'll parse some text files, isolates separate words and add them to
>>either Set(), list, dict, flatfile line by line.
>>
>>Depending on the above, I'll compare them and look for those unique
>>to some "language". I need to keep track of frequencies used
>>in every such language,
> 
> 
> Frequencies of what?  "A language" here can contain some words
> multiple times each?

Yes, to compute the frequency of each word used in say "language A",
I'll count number of occurencies and them compute frequency of it.
Frequency number should be recorded as a value in the dictionary,
where the keys are unique and represent the word itself (or it's hash
as recommended already).

Once more, the dictionary will contain every word only once, it will be really
a unique key.

>>so the dict approach would be the best.  The number stored as a value vould
>>be a float ^H^H^H^H^H^H Decimal() type - very small number.
> 
> 
> Integers are the natural type for counting, and require less memory
> than floats or decimals.

I hoped I was clear ... when I said I'll compute frequency of those words.
The algorithm to compute it will be subject to change during developemnt. ;)

>>Once more, I expect to have between E4 or E5 to E8??? words
>>stored in 20 dictionaries (remember words of sizes in range 1-20?
>>Every of those 20 dictionaries should be, I believe, indexed just once.
>>The indexing method should know all entries in a given file are of same
>>size, i.e. 5 chars, 15 chars, 20 chars etc.
> 
> 
> I think you're making this more complicated than it needs to be.

I hope the algoritm can save some logic. For example, can turn off locking support,
index only part of the key etc.

> 
> 
>>I already did implement the logic to walk through those 20 different
>>dictionaries from language a and from language b and find uout those
>>unique to a or common to both of them. Why I went to ask on this list
>>was to make sure I took right approach. Sets seemed to be better solution
>>for the simple comparison (common, unique). To keep track of those
>>very small frequencies, I anyway have to have dictionaries. I say
>>that again, how can I improve speed of dictionary indexing?
>>It doesn't matter here if I have 10E4 keys in dictionary or
>>10E8 keys in a dictionary.
> 
> 
> What reason do you have for believing that the speed of dictionary
> indexing is worth any bother at all to speed up?  Dict lookup is
> generally considered to be extremely fast already.  If you're just
> *guessing* that indexing is the bottleneck, you're probably guessing
> wrong -- run a profiler to find out where the real bottleneck is.

For example, looking up a key with it's value only once in the whole for loop tells me
I don't need an index. Yes, I'll do this 4 times for those 4 languages, but
still I think it's faster to live without an index, when I can sort
records. I think I like the sorted text file approach, and the dictionary
approach without an index would be almost the same, especially if I manage
to tell the db layout not to move the cursor randomly but just to walk down the
pre-sorted data.

  As an extra, I could record those frequences within a dictionary.
Actually, I need them, that's why I do all this work.

The idea of using Sets() I had just to find more quickly unique/common
words because of the possible speed gain. But I don't see a way to solve
this without dictionaries.