[TriPython] A question of recursion and nested parsing

Ken MacKenzie ken at mack-z.com
Fri Sep 15 14:56:02 EDT 2017


One more update.  As part of extended research I have figured out the juice
is not worth the squeeze on this method.  Reliability concerns inside the
overall data structures.  That being said as a point of academic discussion
on the nested parse engine in recursive vs iterative mode I consider the
conversation worthwhile.  But the code will probably end there for now.

On Fri, Sep 15, 2017 at 2:35 PM, Ken MacKenzie <ken at mack-z.com> wrote:

> Side note response I missed:
>
> 5.  I am fine with commentary on my variable name, or any part of the
> code.  I feel that if one is not ready for criticism of their code then
> they should not put it out publicly.  I mean that is the point of open
> source anyway.  So let the code have it if you see something.  A small
> caveat, the code was made to be a quick R&D throwaway not production code.
> I am sure that will curse it into production at some point by saying it.
> Anyway I would never leave the code as is.  I mean there isn't a docstring
> to be seen and the only comments in the actual code are previous code
> attempts I replaced.  I do know that I have a history of being lazy with
> variable naming though.
>
> Ken
>
> On Fri, Sep 15, 2017 at 12:32 PM, Ken MacKenzie <ken at mack-z.com> wrote:
>
>> OK I have gone through all the responses, let me try to address what I can
>>
>> 1.  I pasted from emacs into gmail, sorry it formatted poorly.  I will
>> get the code out into a repo later today to make it easier to grab.
>>
>> 2.  A note about the code, right now it is a simple module that I am
>> testing in ipython.  So the bigger picture.  We have a u2 DB on Solaris
>> here.  U2 is an atrocious select performer so to serve so RESTful API's
>> (using falcon) to support PowerBI reporting I wrote a "Data Bridge" program
>> that using fabric issues commands against the DB to build XML extracts that
>> I can pull into SQL Server (or another RDBMS using SQL Alchemy) to serve
>> the reporting API's.  However the export process for the U2 side can take
>> an hour for a file with say about 1 million records.  That seemed slow to
>> me, plus no way to try to get it to go faster as that was maxing out a
>> single core on the m10 in question and it is a single threaded app.  This
>> becomes a conundrum as the timing of the extract needs to be after backups
>> but before business starts.
>>
>> So I got to figuring I wonder if I could just pull the raw data files and
>> parse them.  I mean the data I need is plain text in there, I can bring
>> them down with fabric/scp and then work with them manually.  So that is how
>> we are here.
>>
>> 3.  What I know about the data.  There is a list of many possible
>> delimiters but there is a core few I truly care about.  I am no taking the
>> range of possible delimiters and cleaning them out.  By the time we get to
>> the recursive function there are 5 levels of delimiters it could go through
>> in recursion however each level could have a very large branch count but
>> the branches get handled serially so that should be ok.
>>
>> 4.  I did change the code, well added another entry point, to deal with
>> parsing the file line by line as opposed to a readlines().  No immediate
>> change in performance which I expect but when I get to testing say that 1
>> million record file it will become relevant.
>>
>> I hope this helps a bit as to there where why and it is not exactly
>> premature optimization but an attempt to try to proof if this different
>> method can compete or exceed the existing tested performance.
>>
>> Ken
>>
>
>
-------------- next part --------------
   One more update.** As part of extended research I have figured out the
   juice is not worth the squeeze on this method.** Reliability concerns
   inside the overall data structures.** That being said as a point of
   academic discussion on the nested parse engine in recursive vs iterative
   mode I consider the conversation worthwhile.** But the code will probably
   end there for now.
   On Fri, Sep 15, 2017 at 2:35 PM, Ken MacKenzie <[1]ken at mack-z.com> wrote:

     Side note response I missed:
     5.** I am fine with commentary on my variable name, or any part of the
     code.** I feel that if one is not ready for criticism of their code then
     they should not put it out publicly.** I mean that is the point of open
     source anyway.** So let the code have it if you see something.** A small
     caveat, the code was made to be a quick R&D throwaway not production
     code.** I am sure that will curse it into production at some point by
     saying it.** Anyway I would never leave the code as is.** I mean there
     isn't a docstring to be seen and the only comments in the actual code
     are previous code attempts I replaced.** I do know that I have a history
     of being lazy with variable naming though.
     Ken
     On Fri, Sep 15, 2017 at 12:32 PM, Ken MacKenzie <[2]ken at mack-z.com>
     wrote:

       OK I have gone through all the responses, let me try to address what I
       can
       1.** I pasted from emacs into gmail, sorry it formatted poorly.** I
       will get the code out into a repo later today to make it easier to
       grab.
       2.** A note about the code, right now it is a simple module that I am
       testing in ipython.** So the bigger picture.** We have a u2 DB on
       Solaris here.** U2 is an atrocious select performer so to serve so
       RESTful API's (using falcon) to support PowerBI reporting I wrote a
       "Data Bridge" program that using fabric issues commands against the DB
       to build XML extracts that I can pull into SQL Server (or another
       RDBMS using SQL Alchemy) to serve the reporting API's.** However the
       export process for the U2 side can take an hour for a file with say
       about 1 million records.** That seemed slow to me, plus no way to try
       to get it to go faster as that was maxing out a single core on the m10
       in question and it is a single threaded app.** This becomes a
       conundrum as the timing of the extract needs to be after backups but
       before business starts.
       So I got to figuring I wonder if I could just pull the raw data files
       and parse them.** I mean the data I need is plain text in there, I can
       bring them down with fabric/scp and then work with them manually.** So
       that is how we are here.
       3.** What I know about the data.** There is a list of many possible
       delimiters but there is a core few I truly care about.** I am no
       taking the range of possible delimiters and cleaning them out.** By
       the time we get to the recursive function there are 5 levels of
       delimiters it could go through in recursion however each level could
       have a very large branch count but the branches get handled serially
       so that should be ok.
       4.** I did change the code, well added another entry point, to deal
       with parsing the file line by line as opposed to a readlines().** No
       immediate change in performance which I expect but when I get to
       testing say that 1 million record file it will become relevant.
       I hope this helps a bit as to there where why and it is not exactly
       premature optimization but an attempt to try to proof if this
       different method can compete or exceed the existing tested
       performance.
       Ken

References

   Visible links
   1. mailto:ken at mack-z.com
   2. mailto:ken at mack-z.com


More information about the TriZPUG mailing list