Writing huve ge Sets() to disk

Tim Peters tim.peters at gmail.com
Mon Jan 10 18:36:09 EST 2005


[Tim Peters]
>> As I mentioned before, if you store keys in sorted text files,
>> you can do intersection and difference very efficiently just by using
>> the Unix `comm` utiltity.

[Martin MOKREJŠ]
> Now I got your point. I understand the comm(1) is written in C, but it still
> has to scan file1 once and file2 n-times, where n is a number of lines
> in file1, right? Without any index ... I'll consider it, actually will test,
> thanks!

`comm` is much more efficient than that.  Note that the input files
have to be sorted.  Then, in a single pass over both files (not 2
passes, not 3, not n, just 1), it can compute any subset of these
three (do `man comm`):

1. The words common to both files.
2. The words unique to "the first" file.
3. The words unique to "the second" file.

It's essentially just doing a merge on two sorted lists, and how it
works should be obvious if you give it some thought.  It takes time
proportional to the sum of the lengths of the input files, and nothing
*can* be faster than that.

> I was really hoping I'll get an answer how to alter the indexes for dictionaries
> in python.

Sorry, I don't have a guess for what that might mean.

> You convinced me not to even try to construct to theoretical dictionary,
> as it will take ages just to create. Even if I'd manage, I couldn't
> save it (the theoretical and possibly not even the dict(theoret) - dict(real)
> result).

Worse, it didn't sound like a good approach even if you could save it.

> Still, before I give the whole project, once more:
> 
> I'll parse some text files, isolates separate words and add them to
> either Set(), list, dict, flatfile line by line.
>
> Depending on the above, I'll compare them and look for those unique
> to some "language". I need to keep track of frequencies used
> in every such language,

Frequencies of what?  "A language" here can contain some words
multiple times each?

> so the dict approach would be the best.  The number stored as a value vould
> be a float ^H^H^H^H^H^H Decimal() type - very small number.

Integers are the natural type for counting, and require less memory
than floats or decimals.

> Once more, I expect to have between E4 or E5 to E8??? words
> stored in 20 dictionaries (remember words of sizes in range 1-20?
> Every of those 20 dictionaries should be, I believe, indexed just once.
> The indexing method should know all entries in a given file are of same
> size, i.e. 5 chars, 15 chars, 20 chars etc.

I think you're making this more complicated than it needs to be.

> I already did implement the logic to walk through those 20 different
> dictionaries from language a and from language b and find uout those
> unique to a or common to both of them. Why I went to ask on this list
> was to make sure I took right approach. Sets seemed to be better solution
> for the simple comparison (common, unique). To keep track of those
> very small frequencies, I anyway have to have dictionaries. I say
> that again, how can I improve speed of dictionary indexing?
> It doesn't matter here if I have 10E4 keys in dictionary or
> 10E8 keys in a dictionary.

What reason do you have for believing that the speed of dictionary
indexing is worth any bother at all to speed up?  Dict lookup is
generally considered to be extremely fast already.  If you're just
*guessing* that indexing is the bottleneck, you're probably guessing
wrong -- run a profiler to find out where the real bottleneck is.



More information about the Python-list mailing list