List or Dictionairy or neither?

Bengt Richter bokr at accessone.com
Wed Aug 22 19:53:31 EDT 2001


On Fri, 17 Aug 2001 12:08:25 +0200, Dietmar Lang <dietmar at wohnheim.fh-wedel.de> wrote:

>
>Greetings, fellow Pythoneers,
>
>I am currently working on an application that works with a simple, but
>relatively large data structure: a large list, with its items being
>strings, 3-4 lines long each.
>
>Now the list is shelved and I am accessing its items pretty randomly.
>I've found the access times to be rather slow. So my questions are:
>
>- Would using a dictionary make access faster?
>
>- Is there a faster method for saving and accessing my stuff?
>
>- Would saving the data as one big text file with separators/indexes and
>a good search algorithm be faster? (Since my strings won't exceed a
>certain length, maybe defining a max size for each item and then seeking
>into the file with, say, 5-line distances?)
>
If you don't want to use Aahz's database suggestion, you could do something
with a file as you mention, but I am wondering what you are using as a key to
selecting a particular string, or if there might be more than one key to
the same string.

One idea would be to write all the strings end to end into a file, and
store their start byte positions and lengths as tuple values for keys in a dictionary,
somethin like

    ### warning -- untested
    stringDict = {}
    f = open('stringfile','wb')	# binary is better for accurate seeking on windows

    # ... loop to write strings to file:
    str = 'example string number three, or whatever'
    stringKey = 3 # or whatever
    stringFilePosition = f.tell()
    stringLength = len(str)
    f.write(str)
    stringDict[stringKey] = (stringFilePosition, stringLength)
    # ...

    f.close()

You could do this using numeric sequential keys, or using anything immutable for keys.

Then to get a string, just do

    ### warning -- untested
    f = open('stringfile','rb')
    stringKey = 5 # whatever you're looking for
    f.seek(stringDict[stringKey][0])
    str = f.read(stringDict[stringKey][1])
    f.close()

(Or keep the file open if you are doing a bunch).
It shouldn't take more than a few milliseconds to get any string.

If you have a really huge set of strings, you could store them in multiple files
and add a file id character to the tuple, to know which file to open. Or you could
have a separate dictionary for each file, where knowing which dictionary to use would
be easy if the key were numeric. If you just have your strings sequentially numbered
and don't use any other key, you could just index into a list or lists, instead of
using a dictionary.

Lots of ways to skin the cat. You could also consider compressing in various ways, if
you have huge amounts of data. But try the dead simple thing first. The above should be close,
and you can try it in a few minutes.

You could also make a handy class to instantiate with a file name, and have methods for
retrieving and/or adding strings with various keys. It could pickle its stringDict under a name
like the stringfile with a different extension or something. If you add a special delimiter
character ('\x00' ?) at the end of strings in your file, you will have a more robust system
for re-indexing without going back to original sources. If you do, don't include it in the
string length in the tuple, so it won't be included in the string you read back.

HTH




More information about the Python-list mailing list