personal document mgmt system idea

Tue Jan 20 08:38:51 EST 2004

Hi folks,

I have been mulling over an idea for a very simple python-based
personal document management system. The source of this possible
solution is the following typical problem:

I accumulate a lot of files (documents, archives, pdfs, images, etc.)
on a daily basis and storing them in a hierarchical file system is
simple but unsatisfactory:

       - deeply nested hierarchies are a pain to navigate 
         and to reorganize
       - different file systems have inconsistent and weak schemes 
         for storing metadata e.g. compare variety of incompatible 
         schemes in windows alone (office docs vs. pdfs etc.) .

I would like a personal document management system that:

	- is of adequate and usable performance
	- can accomodate data files of up to 50MB
	- is simple and easy to use
	- promotes maximum programmibility
	- allows for the selective replication (or backup) of data 
          over a network
	- allows for multiple (custom) classification schemes
	- is portable across operating systems 

The system should promote the following simple pattern:

    receive file -> drop it into 'special' folder

    after an arbitrary period of doing the above n times -> run
application

    for each file in folder:
	if automatic metadata extraction is possible:
	   scan file for metadata and populate fields accordingly
	   fill in missing metadata
	else:
	   enter metadata
	store file

    every now and then:
	  run replicator function of application  -> will backup data
						     over a network
	  # this will make specified files available to co-workers
	  # accessing a much larger web-based non-personal version of the
	  # docmanagement system.

My initial prototyping efforts involved creating a single test table
in
mysql (later to include fields for dublin-core metadata elements) 
and a BLOB field for the data itself. My present dev platform is
windows XP pro, mysql 4.1.1-alpha,  MySQL-python connector v.0.9.2
and python 2.3.3 . However, I will be testing the same app on Mac OS X
and Linux Mandrake 9.2 as well.

The first problem I've run into is that mysql or the MySQL
connector crashes when the size of one BLOB reaches a certain point:
in this case an .avi file of 7.2 mb . 

Here's the code:

<code>

import sys, time, os, zlib
import MySQLdb, _mysql

def initDB(db='test'):
    connection = MySQLdb.Connect("localhost", "sa")
    cursor = connection.cursor()
    cursor.execute("use %s;" % db)
    return (connection, cursor)

def close(connection, cursor):
    connection.close()
    cursor.close()

def drop_table(cursor):
    try:
        cursor.execute("drop table tstable")
    except:
        pass

def create_table(cursor):
    cursor.execute('''create table tstable 
        (   id      INTEGER PRIMARY KEY AUTO_INCREMENT,
            name    VARCHAR(100),
            data    BLOB
        );''')

def process(data):
    data =  zlib.compress(data, 9)
    return _mysql.escape_string(data)

def populate_table(cursor):
    files = [(f, os.path.join('testdocs', f)) for f in
os.listdir('testdocs')]
    for filename, filepath in files:
        t1 = time.time()
        data = open(filepath, 'rb').read()
        data = process(data)
        # IMPORTANT: you have to quote the binary txt even after
escaping it.
        cursor.execute('''insert into tstable (id, name, data)
			  values (NULL, '%s', '%s')''' % (filename, data))
        print time.time() - t1, 'seconds for ', filepath

def main ():
    connection, cursor = initDB()
    # doit
    drop_table(cursor)
    create_table(cursor)
    populate_table(cursor)
    close(connection, cursor)

if __name__ == "__main__":
    t1 = time.time()
    main ()
    print '=> it took total ', time.time() - t1, 'seconds to complete'

</code>

<traceback>

>pythonw -u "test_blob.py"
0.155999898911 seconds for  testdocs\business plan.doc
0.0160000324249 seconds for  testdocs\concept2businessprocess.pdf
0.0160000324249 seconds for  testdocs\diagram.vsd
0.0149998664856 seconds for  testdocs\logo.jpg
Traceback (most recent call last):
  File "test_blob.py", line 59, in ?
    main ()
  File "test_blob.py", line 53, in main
    populate_table(cursor)
  File "test_blob.py", line 44, in populate_table
    cursor.execute('''insert into tstable (id, name, data) values
(NULL, '%s', '%s')''' % (filename, data))
  File "C:\Engines\Python23\Lib\site-packages\MySQLdb\cursors.py",
line 95, in execute
    return self._execute(query, args)
  File "C:\Engines\Python23\Lib\site-packages\MySQLdb\cursors.py",
line 114, in _execute
    self.errorhandler(self, exc, value)
  File "C:\Engines\Python23\Lib\site-packages\MySQLdb\connections.py",
line 33, in defaulterrorhandler
    raise errorclass, errorvalue
_mysql_exceptions.OperationalError: (2006, 'MySQL server has gone
away')
>Exit code: 1

</traceback>

My Questions are:

- Is my test code at fault?

- Is this the wrong approach to begin with: i.e. is it a bad idea to
  store the data itself in the database?

- Am I using the wrong database? (or is the connector just buggy?)

Thanks to all.

best regards,

Sandy Norton