[Chicago] When to load?

Aaron Elmquist elmq0022 at umn.edu
Mon Feb 1 13:03:44 EST 2016


If simple persistence on disk answers your problem, consider the json
module.  Simple dictionaries map to json objects well and you get all the
benefits of json.

Pickle is great, but I often read about security issues (execution of
arbitrary code) and the binary format is not consistent across CPython
versions.  If you do go the pickle route, maybe look at shelve as well.

Also, I wonder if you need to pull in a large portion of the database prior
to web scraping.  Could you build updates in memory and push the updates to
the database after you parse the file(s)?

If you do have a lot of parameters that are annoying to look at, I would
encapsulate them in single object (a namedtuple from the collections module
would be great for this) and pass the object to the parse routine.

On Mon, Feb 1, 2016 at 11:25 AM, Joshua Herman <zitterbewegung at gmail.com>
wrote:

> Dear Leon,
> You might want to use pickle or dill (if pickle doesn't work) to
> serialize the dictionaries to disk so that you don't have to do the
> complicated construction of dictionaries. I assume that they are
> computationally expensive to construct so if you serialize them once
> they are constructed you just need the space to store the serialized
> form. Once they are serialized you should only have to generate them
> once and then you just have to load it into your environment.
>
> pickle.dump( obj_you_want_to_persist, open( "yourpicklefilenamehere.p",
> "wb" ) )
>
> See https://docs.python.org/2/library/pickle.html# or
> https://docs.python.org/3/library/pickle.html# for more information.
>
> Sincerely,
> Joshua Herman
>
> On Mon, Feb 1, 2016 at 10:13 AM, Leon Shernoff
> <leon at mushroomthejournal.com> wrote:
> > Hello,
> >
> > I have a modularity design question. I am writing a program that, as it
> goes
> > along, calls a text-parsing routine. In fact, the main program is a
> scraping
> > program (or pseudo-scraping -- it will also run on a collection of text
> > files) that runs this parsing routine in a loop over many pages/files.
> >
> > The parsing routine calls various other subroutines, so I'd like to put
> the
> > whole set of them in a separate file that gets imported by the main
> program.
> > The parsing program uses several dictionaries of terms, and as it
> processes
> > more and more texts it adds more terms to those dictionaries and they get
> > stored in a database that is read at launch to construct the
> dictionaries.
> > So the dictionaries are a bit expensive to generate and I'd like to have
> to
> > construct them only once.
> >
> > So, I'm unclear on the persistence here (experienced developer, pretty
> new
> > to Python):
> >
> > 1) If I put the database-read dictionary-construction code in the
> parser's
> > file, will those get run (and the dictionaries reconstructed) each time
> the
> > main program uses the parser?
> >
> > 2) If so, do I need to construct the dictionaries in the main program and
> > pass them to the parser each time I invoke it? That would make for
> several
> > parameters, all of which would be the same each time except for the text
> to
> > be parsed. This may be one of those things that's more annoying to humans
> > than it is to machines; but if the whole point of sequestering the parse
> > routines in a separate file is to make my main program look cleaner and
> > understand, it is kind of backwards to do that and then issue ugly,
> > cluttered calls to those routines. :-)
> >
> > 3) Is there a better way? (or is #1 just not a problem and they only get
> > constructed once) (Please, please...)  :-)
> >
> > --
> > Best regards,
> >     Leon
> >
> > "Creative work defines itself; therefore, confront the work."
> >      -- John Cage
> >
> >
> > Leon Shernoff
> > 1511 E 54th St, Bsmt
> > Chicago, IL  60615
> >
> > (312) 320-2190
> >
> > _______________________________________________
> > Chicago mailing list
> > Chicago at python.org
> > https://mail.python.org/mailman/listinfo/chicago
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> https://mail.python.org/mailman/listinfo/chicago
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20160201/52b855f4/attachment.html>


More information about the Chicago mailing list