[TriPython] A question of recursion and nested parsing

Ken MacKenzie ken at mack-z.com
Fri Sep 15 12:32:17 EDT 2017


OK I have gone through all the responses, let me try to address what I can

1.  I pasted from emacs into gmail, sorry it formatted poorly.  I will get
the code out into a repo later today to make it easier to grab.

2.  A note about the code, right now it is a simple module that I am
testing in ipython.  So the bigger picture.  We have a u2 DB on Solaris
here.  U2 is an atrocious select performer so to serve so RESTful API's
(using falcon) to support PowerBI reporting I wrote a "Data Bridge" program
that using fabric issues commands against the DB to build XML extracts that
I can pull into SQL Server (or another RDBMS using SQL Alchemy) to serve
the reporting API's.  However the export process for the U2 side can take
an hour for a file with say about 1 million records.  That seemed slow to
me, plus no way to try to get it to go faster as that was maxing out a
single core on the m10 in question and it is a single threaded app.  This
becomes a conundrum as the timing of the extract needs to be after backups
but before business starts.

So I got to figuring I wonder if I could just pull the raw data files and
parse them.  I mean the data I need is plain text in there, I can bring
them down with fabric/scp and then work with them manually.  So that is how
we are here.

3.  What I know about the data.  There is a list of many possible
delimiters but there is a core few I truly care about.  I am no taking the
range of possible delimiters and cleaning them out.  By the time we get to
the recursive function there are 5 levels of delimiters it could go through
in recursion however each level could have a very large branch count but
the branches get handled serially so that should be ok.

4.  I did change the code, well added another entry point, to deal with
parsing the file line by line as opposed to a readlines().  No immediate
change in performance which I expect but when I get to testing say that 1
million record file it will become relevant.

I hope this helps a bit as to there where why and it is not exactly
premature optimization but an attempt to try to proof if this different
method can compete or exceed the existing tested performance.

Ken
-------------- next part --------------
   OK I have gone through all the responses, let me try to address what I can
   1.** I pasted from emacs into gmail, sorry it formatted poorly.** I will
   get the code out into a repo later today to make it easier to grab.
   2.** A note about the code, right now it is a simple module that I am
   testing in ipython.** So the bigger picture.** We have a u2 DB on Solaris
   here.** U2 is an atrocious select performer so to serve so RESTful API's
   (using falcon) to support PowerBI reporting I wrote a "Data Bridge"
   program that using fabric issues commands against the DB to build XML
   extracts that I can pull into SQL Server (or another RDBMS using SQL
   Alchemy) to serve the reporting API's.** However the export process for
   the U2 side can take an hour for a file with say about 1 million
   records.** That seemed slow to me, plus no way to try to get it to go
   faster as that was maxing out a single core on the m10 in question and it
   is a single threaded app.** This becomes a conundrum as the timing of the
   extract needs to be after backups but before business starts.
   So I got to figuring I wonder if I could just pull the raw data files and
   parse them.** I mean the data I need is plain text in there, I can bring
   them down with fabric/scp and then work with them manually.** So that is
   how we are here.
   3.** What I know about the data.** There is a list of many possible
   delimiters but there is a core few I truly care about.** I am no taking
   the range of possible delimiters and cleaning them out.** By the time we
   get to the recursive function there are 5 levels of delimiters it could go
   through in recursion however each level could have a very large branch
   count but the branches get handled serially so that should be ok.
   4.** I did change the code, well added another entry point, to deal with
   parsing the file line by line as opposed to a readlines().** No immediate
   change in performance which I expect but when I get to testing say that 1
   million record file it will become relevant.
   I hope this helps a bit as to there where why and it is not exactly
   premature optimization but an attempt to try to proof if this different
   method can compete or exceed the existing tested performance.
   Ken


More information about the TriZPUG mailing list