optimizing memory utilization

Bengt Richter bokr at oz.net
Wed Sep 15 15:52:25 EDT 2004


On Wed, 15 Sep 2004 03:44:36 GMT, "Anon" <anon at ymous.com> wrote:

>On Wed, 15 Sep 2004 02:20:38 +0000, Bengt Richter wrote:
>
>> Need example of raw input and desired output ;-)
>The input is:
>- one "csv" file containing a list of albums with their artist and an
>AlbumID - one "csv" file containing a list of track names and associated
>AlbumID - one big file containing fully-qualified paths to mp3 files
Is the latter the 10k file to be searched for matches to patterns from the former two?
If so, you could certainly do better than walking through mp3 paths sliding
a pattern window one character at a time, since windows containing path delimiters
are not likely to match anything. IOW, you probably want to split on the delimiters
and also do some text normalization before trying to match.

How about some actual raw data? ;-) I.e., a few lines from the raw csv files and
the mp3 path file. Actual samples will probably raise questions about how the search
patterns are formed and how the searched data may have to be modified to match. E.g.,
removing punctuation, collapsing multiple embedded spaces, case sensitivity, spelling, etc.

>
>The desired output is:
>and XML file with the "proposed" artist, album, and track name for each
>file.  The file name would be the top level element for a given XML node
>and then there would be element groupings of "proposed" artist, album, and
>track name for each file.
XML lets you do stuff like <foo fie="fie" fo="fum /> or <foo><fie>fie</fie><fo>fum</fo></foo>
and variations. You can generate a lot of fluff if you don't watch out. Compare the
economy of python source vs XML representing python source ;-) Do you have
a utility that is going to accept the XML as input, and so is defining the format?

>
>The last step is to edit the XML file and "select" the desired fields from
>list of candidates for each fields, and then use the XML as a work order
>for tagging (and possibly renaming) all of the identified files.
>
>
>> Patterns starting at any character? What if you have overlapping
>> matches? I.e., if you had a string 'abbcde123', and were looking for
>> 'bb' and 'bc' would it count as two matches?  Would longer matches have
>> priority? E.g., matching 'bbcd' overrides 'bb and 'cd' as separate
>> matches?
>
>Yes, yes, and yes...
>
You'll need a change to make long patterns come out first, but no big deal.
>
>> Anyway, let's see what a brute force simple approach does. E.g, just put
>> the 500,000 small strings in a small tree of dictionaries based on say
>> string length, then the string (you could tier it again with some
>> leading characters and the rest, but I'd try the easy thing first. But
>> anyway, the idea (practically untested) is something like:
>
>I had planned to sort the list based upon string length (and probably
>apply some minimum length requirement too), but the tree of dictionaries
>surely seems a bit more elegant.
>
Before that, though, are you going to modify the strings for consistent
white space and punctuation? Or check spellings? )On the other end too, of course).
>
>> This finds repeating and overlapping patterns, since it was easy. That
>> may not be what you want, but you didn't say ;-)
>>
>> ----< s500kin10k.py >--------------------------------------- def
[...]
>> 
>Gee, thanks.  That's a nice example!
>
You're welcome.

>> I suspect that you won't be that happy iterating through just arrays,
>> unless they are arrays of patternmatching tables like IIRC flex
>> generates to do parsing. Basically doing tricky parallel pattern
>> matching by keeping track of which patterns are alive as you go from
>> character to character. But 500k patterns are a lot of patterns....
>
>Since I only want to do it once in rare while, execution time isn't a huge
>deal...  I could readily write this code in C and make my data fit within
>a reasonable amount of RAM, but I'm trying to use the project as an
>opportunity to lean Python.
Sounds like a good way to go. If you are good in C, you can also learn how
to make the best of both worlds by writing your own extension modules in C
that you can import in python.

Don't forget to look into available modules that may help. E.g., the array
module can make arrays of numbers very efficient. mmap can let you look at
a file as if it were a character array, etc. Also re for regular expressions,
which you may need at some point.

>
>
>> See if the above runs you out of memory. Or time ;-) I'd try it on
>> shorter files first. You didn't say what you wanted your final output to
>> look like. Small examples of input => output communicate well ;-)
>
>Well, basically, I'm trying to use the FreeDB database as a list of valid
>artists, albums, and tracks to locate these strings within the file names
>of mp3 files and then use those result to apply ID tags to the mp3 files.
>I'm assuming that I'll have multiple matches for any given file so my plan
>was to output to an XML file, hand 'clean' that, and then use the XML as
>the input for the actual tagging operation.
>
Again, "locate these strings" is easy to say, but each word hides a lot of
meaning. Messing with a small batch of _representative_ actual data will tell you a lot ;-)

Good luck.

Regards,
Bengt Richter



More information about the Python-list mailing list