personal document mgmt system idea

Tue Jan 20 11:17:16 EST 2004

I wouldn't put the individual files in a data base - that's what
file systems are for. The exception is small files (and by the
time you say ".doc" in MS Word, it's now longer a small
file) where you can save substantial space by consolidating
them.

John Roth

"Sandy Norton" <sandskyfly at hotmail.com> wrote in message
news:b03e80d.0401200538.10fcf33a at posting.google.com...
> Hi folks,
>
> I have been mulling over an idea for a very simple python-based
> personal document management system. The source of this possible
> solution is the following typical problem:
>
> I accumulate a lot of files (documents, archives, pdfs, images, etc.)
> on a daily basis and storing them in a hierarchical file system is
> simple but unsatisfactory:
>
>        - deeply nested hierarchies are a pain to navigate
>          and to reorganize
>        - different file systems have inconsistent and weak schemes
>          for storing metadata e.g. compare variety of incompatible
>          schemes in windows alone (office docs vs. pdfs etc.) .
>
> I would like a personal document management system that:
>
> - is of adequate and usable performance
> - can accomodate data files of up to 50MB
> - is simple and easy to use
> - promotes maximum programmibility
> - allows for the selective replication (or backup) of data
>           over a network
> - allows for multiple (custom) classification schemes
> - is portable across operating systems
>
> The system should promote the following simple pattern:
>
>     receive file -> drop it into 'special' folder
>
>     after an arbitrary period of doing the above n times -> run
> application
>
>     for each file in folder:
> if automatic metadata extraction is possible:
>    scan file for metadata and populate fields accordingly
>    fill in missing metadata
> else:
>    enter metadata
> store file
>
>     every now and then:
>   run replicator function of application  -> will backup data
>      over a network
>   # this will make specified files available to co-workers
>   # accessing a much larger web-based non-personal version of the
>   # docmanagement system.
>
> My initial prototyping efforts involved creating a single test table
> in
> mysql (later to include fields for dublin-core metadata elements)
> and a BLOB field for the data itself. My present dev platform is
> windows XP pro, mysql 4.1.1-alpha,  MySQL-python connector v.0.9.2
> and python 2.3.3 . However, I will be testing the same app on Mac OS X
> and Linux Mandrake 9.2 as well.
>
> The first problem I've run into is that mysql or the MySQL
> connector crashes when the size of one BLOB reaches a certain point:
> in this case an .avi file of 7.2 mb .
>
> Here's the code:
>
> <code>
>
> import sys, time, os, zlib
> import MySQLdb, _mysql
>
>
> def initDB(db='test'):
>     connection = MySQLdb.Connect("localhost", "sa")
>     cursor = connection.cursor()
>     cursor.execute("use %s;" % db)
>     return (connection, cursor)
>
> def close(connection, cursor):
>     connection.close()
>     cursor.close()
>
> def drop_table(cursor):
>     try:
>         cursor.execute("drop table tstable")
>     except:
>         pass
>
> def create_table(cursor):
>     cursor.execute('''create table tstable
>         (   id      INTEGER PRIMARY KEY AUTO_INCREMENT,
>             name    VARCHAR(100),
>             data    BLOB
>         );''')
>
> def process(data):
>     data =  zlib.compress(data, 9)
>     return _mysql.escape_string(data)
>
> def populate_table(cursor):
>     files = [(f, os.path.join('testdocs', f)) for f in
> os.listdir('testdocs')]
>     for filename, filepath in files:
>         t1 = time.time()
>         data = open(filepath, 'rb').read()
>         data = process(data)
>         # IMPORTANT: you have to quote the binary txt even after
> escaping it.
>         cursor.execute('''insert into tstable (id, name, data)
>   values (NULL, '%s', '%s')''' % (filename, data))
>         print time.time() - t1, 'seconds for ', filepath
>
>
> def main ():
>     connection, cursor = initDB()
>     # doit
>     drop_table(cursor)
>     create_table(cursor)
>     populate_table(cursor)
>     close(connection, cursor)
>
>
> if __name__ == "__main__":
>     t1 = time.time()
>     main ()
>     print '=> it took total ', time.time() - t1, 'seconds to complete'
>
> </code>
>
> <traceback>
>
> >pythonw -u "test_blob.py"
> 0.155999898911 seconds for  testdocs\business plan.doc
> 0.0160000324249 seconds for  testdocs\concept2businessprocess.pdf
> 0.0160000324249 seconds for  testdocs\diagram.vsd
> 0.0149998664856 seconds for  testdocs\logo.jpg
> Traceback (most recent call last):
>   File "test_blob.py", line 59, in ?
>     main ()
>   File "test_blob.py", line 53, in main
>     populate_table(cursor)
>   File "test_blob.py", line 44, in populate_table
>     cursor.execute('''insert into tstable (id, name, data) values
> (NULL, '%s', '%s')''' % (filename, data))
>   File "C:\Engines\Python23\Lib\site-packages\MySQLdb\cursors.py",
> line 95, in execute
>     return self._execute(query, args)
>   File "C:\Engines\Python23\Lib\site-packages\MySQLdb\cursors.py",
> line 114, in _execute
>     self.errorhandler(self, exc, value)
>   File "C:\Engines\Python23\Lib\site-packages\MySQLdb\connections.py",
> line 33, in defaulterrorhandler
>     raise errorclass, errorvalue
> _mysql_exceptions.OperationalError: (2006, 'MySQL server has gone
> away')
> >Exit code: 1
>
> </traceback>
>
> My Questions are:
>
> - Is my test code at fault?
>
> - Is this the wrong approach to begin with: i.e. is it a bad idea to
>   store the data itself in the database?
>
> - Am I using the wrong database? (or is the connector just buggy?)
>
>
> Thanks to all.
>
> best regards,
>
> Sandy Norton