Large Amount of Data

Steve Holden steve at holdenweb.com
Sat May 26 12:00:33 EDT 2007


Jack wrote:
> "John Nagle" <nagle at animats.com> wrote in message 
> news:nfR5i.4273$C96.1640 at newssvr23.news.prodigy.net...
>> Jack wrote:
>>> I need to process large amount of data. The data structure fits well
>>> in a dictionary but the amount is large - close to or more than the size
>>> of physical memory. I wonder what will happen if I try to load the data
>>> into a dictionary. Will Python use swap memory or will it fail?
>>>
>>> Thanks.
>>     What are you trying to do?  At one extreme, you're implementing 
>> something
>> like a search engine that needs gigabytes of bitmaps to do joins fast as
>> hundreds of thousands of users hit the server, and need to talk seriously
>> about 64-bit address space machines.  At the other, you have no idea how
>> to either use a database or do sequential processing.  Tell us more.
>>
 > I have tens of millions (could be more) of document in files. Each of 
them
 > has other
 > properties in separate files. I need to check if they exist, update and
 > merge properties, etc.
 > And this is not a one time job. Because of the quantity of the files, I
 > think querying and
 > updating a database will take a long time...
 >
And I think you are wrong. But of course the only way to find out who's 
right and who's wrong is to do some experiments and get some benchmark 
timings.

All I *would* say is that it's unwise to proceed with a memory-only 
architecture when you only have assumptions about the limitations of 
particular architectures, and your problem might actually grow to exceed 
the memory limits of a 32-bit architecture anyway.

Swapping might, depending on access patterns, cause you performance to 
take a real nose-dive. Then where do you go? Much better to architect 
the application so that you anticipate exceeding memory limits from the 
start, I'd hazard.

 > Let's say, I want to do something a search engine needs to do in 
terms of
 > the amount of
 > data to be processed on a server. I doubt any serious search engine 
would
 > use a database
 > for indexing and searching. A hash table is what I need, not powerful
 > queries.
 >
You might be surprised. Google, for example, use a widely-distributed 
and highly-redundant storage format, but they certainly don't keep the 
whole Internet in memory :-)

Perhaps you need to explain the problem in more detail if you still need 
help.

regards
  Steve


-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC/Ltd           http://www.holdenweb.com
Skype: holdenweb      http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com        squidoo.com/pythonology
tagged items:         del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------




More information about the Python-list mailing list