[TriPython] A question of recursion and nested parsing

Ken MacKenzie ken at mack-z.com
Fri Sep 15 14:35:11 EDT 2017


Side note response I missed:

5.  I am fine with commentary on my variable name, or any part of the
code.  I feel that if one is not ready for criticism of their code then
they should not put it out publicly.  I mean that is the point of open
source anyway.  So let the code have it if you see something.  A small
caveat, the code was made to be a quick R&D throwaway not production code.
I am sure that will curse it into production at some point by saying it.
Anyway I would never leave the code as is.  I mean there isn't a docstring
to be seen and the only comments in the actual code are previous code
attempts I replaced.  I do know that I have a history of being lazy with
variable naming though.

Ken

On Fri, Sep 15, 2017 at 12:32 PM, Ken MacKenzie <ken at mack-z.com> wrote:

> OK I have gone through all the responses, let me try to address what I can
>
> 1.  I pasted from emacs into gmail, sorry it formatted poorly.  I will get
> the code out into a repo later today to make it easier to grab.
>
> 2.  A note about the code, right now it is a simple module that I am
> testing in ipython.  So the bigger picture.  We have a u2 DB on Solaris
> here.  U2 is an atrocious select performer so to serve so RESTful API's
> (using falcon) to support PowerBI reporting I wrote a "Data Bridge" program
> that using fabric issues commands against the DB to build XML extracts that
> I can pull into SQL Server (or another RDBMS using SQL Alchemy) to serve
> the reporting API's.  However the export process for the U2 side can take
> an hour for a file with say about 1 million records.  That seemed slow to
> me, plus no way to try to get it to go faster as that was maxing out a
> single core on the m10 in question and it is a single threaded app.  This
> becomes a conundrum as the timing of the extract needs to be after backups
> but before business starts.
>
> So I got to figuring I wonder if I could just pull the raw data files and
> parse them.  I mean the data I need is plain text in there, I can bring
> them down with fabric/scp and then work with them manually.  So that is how
> we are here.
>
> 3.  What I know about the data.  There is a list of many possible
> delimiters but there is a core few I truly care about.  I am no taking the
> range of possible delimiters and cleaning them out.  By the time we get to
> the recursive function there are 5 levels of delimiters it could go through
> in recursion however each level could have a very large branch count but
> the branches get handled serially so that should be ok.
>
> 4.  I did change the code, well added another entry point, to deal with
> parsing the file line by line as opposed to a readlines().  No immediate
> change in performance which I expect but when I get to testing say that 1
> million record file it will become relevant.
>
> I hope this helps a bit as to there where why and it is not exactly
> premature optimization but an attempt to try to proof if this different
> method can compete or exceed the existing tested performance.
>
> Ken
>
-------------- next part --------------
   Side note response I missed:
   5.** I am fine with commentary on my variable name, or any part of the
   code.** I feel that if one is not ready for criticism of their code then
   they should not put it out publicly.** I mean that is the point of open
   source anyway.** So let the code have it if you see something.** A small
   caveat, the code was made to be a quick R&D throwaway not production
   code.** I am sure that will curse it into production at some point by
   saying it.** Anyway I would never leave the code as is.** I mean there
   isn't a docstring to be seen and the only comments in the actual code are
   previous code attempts I replaced.** I do know that I have a history of
   being lazy with variable naming though.
   Ken
   On Fri, Sep 15, 2017 at 12:32 PM, Ken MacKenzie <[1]ken at mack-z.com> wrote:

     OK I have gone through all the responses, let me try to address what I
     can
     1.** I pasted from emacs into gmail, sorry it formatted poorly.** I will
     get the code out into a repo later today to make it easier to grab.
     2.** A note about the code, right now it is a simple module that I am
     testing in ipython.** So the bigger picture.** We have a u2 DB on
     Solaris here.** U2 is an atrocious select performer so to serve so
     RESTful API's (using falcon) to support PowerBI reporting I wrote a
     "Data Bridge" program that using fabric issues commands against the DB
     to build XML extracts that I can pull into SQL Server (or another RDBMS
     using SQL Alchemy) to serve the reporting API's.** However the export
     process for the U2 side can take an hour for a file with say about 1
     million records.** That seemed slow to me, plus no way to try to get it
     to go faster as that was maxing out a single core on the m10 in question
     and it is a single threaded app.** This becomes a conundrum as the
     timing of the extract needs to be after backups but before business
     starts.
     So I got to figuring I wonder if I could just pull the raw data files
     and parse them.** I mean the data I need is plain text in there, I can
     bring them down with fabric/scp and then work with them manually.** So
     that is how we are here.
     3.** What I know about the data.** There is a list of many possible
     delimiters but there is a core few I truly care about.** I am no taking
     the range of possible delimiters and cleaning them out.** By the time we
     get to the recursive function there are 5 levels of delimiters it could
     go through in recursion however each level could have a very large
     branch count but the branches get handled serially so that should be ok.
     4.** I did change the code, well added another entry point, to deal with
     parsing the file line by line as opposed to a readlines().** No
     immediate change in performance which I expect but when I get to testing
     say that 1 million record file it will become relevant.
     I hope this helps a bit as to there where why and it is not exactly
     premature optimization but an attempt to try to proof if this different
     method can compete or exceed the existing tested performance.
     Ken

References

   Visible links
   1. mailto:ken at mack-z.com


More information about the TriZPUG mailing list