[TriPython] A question of recursion and nested parsing
Ken MacKenzie
ken at mack-z.com
Fri Sep 15 14:35:11 EDT 2017
Side note response I missed:
5. I am fine with commentary on my variable name, or any part of the
code. I feel that if one is not ready for criticism of their code then
they should not put it out publicly. I mean that is the point of open
source anyway. So let the code have it if you see something. A small
caveat, the code was made to be a quick R&D throwaway not production code.
I am sure that will curse it into production at some point by saying it.
Anyway I would never leave the code as is. I mean there isn't a docstring
to be seen and the only comments in the actual code are previous code
attempts I replaced. I do know that I have a history of being lazy with
variable naming though.
Ken
On Fri, Sep 15, 2017 at 12:32 PM, Ken MacKenzie <ken at mack-z.com> wrote:
> OK I have gone through all the responses, let me try to address what I can
>
> 1. I pasted from emacs into gmail, sorry it formatted poorly. I will get
> the code out into a repo later today to make it easier to grab.
>
> 2. A note about the code, right now it is a simple module that I am
> testing in ipython. So the bigger picture. We have a u2 DB on Solaris
> here. U2 is an atrocious select performer so to serve so RESTful API's
> (using falcon) to support PowerBI reporting I wrote a "Data Bridge" program
> that using fabric issues commands against the DB to build XML extracts that
> I can pull into SQL Server (or another RDBMS using SQL Alchemy) to serve
> the reporting API's. However the export process for the U2 side can take
> an hour for a file with say about 1 million records. That seemed slow to
> me, plus no way to try to get it to go faster as that was maxing out a
> single core on the m10 in question and it is a single threaded app. This
> becomes a conundrum as the timing of the extract needs to be after backups
> but before business starts.
>
> So I got to figuring I wonder if I could just pull the raw data files and
> parse them. I mean the data I need is plain text in there, I can bring
> them down with fabric/scp and then work with them manually. So that is how
> we are here.
>
> 3. What I know about the data. There is a list of many possible
> delimiters but there is a core few I truly care about. I am no taking the
> range of possible delimiters and cleaning them out. By the time we get to
> the recursive function there are 5 levels of delimiters it could go through
> in recursion however each level could have a very large branch count but
> the branches get handled serially so that should be ok.
>
> 4. I did change the code, well added another entry point, to deal with
> parsing the file line by line as opposed to a readlines(). No immediate
> change in performance which I expect but when I get to testing say that 1
> million record file it will become relevant.
>
> I hope this helps a bit as to there where why and it is not exactly
> premature optimization but an attempt to try to proof if this different
> method can compete or exceed the existing tested performance.
>
> Ken
>
-------------- next part --------------
Side note response I missed:
5.** I am fine with commentary on my variable name, or any part of the
code.** I feel that if one is not ready for criticism of their code then
they should not put it out publicly.** I mean that is the point of open
source anyway.** So let the code have it if you see something.** A small
caveat, the code was made to be a quick R&D throwaway not production
code.** I am sure that will curse it into production at some point by
saying it.** Anyway I would never leave the code as is.** I mean there
isn't a docstring to be seen and the only comments in the actual code are
previous code attempts I replaced.** I do know that I have a history of
being lazy with variable naming though.
Ken
On Fri, Sep 15, 2017 at 12:32 PM, Ken MacKenzie <[1]ken at mack-z.com> wrote:
OK I have gone through all the responses, let me try to address what I
can
1.** I pasted from emacs into gmail, sorry it formatted poorly.** I will
get the code out into a repo later today to make it easier to grab.
2.** A note about the code, right now it is a simple module that I am
testing in ipython.** So the bigger picture.** We have a u2 DB on
Solaris here.** U2 is an atrocious select performer so to serve so
RESTful API's (using falcon) to support PowerBI reporting I wrote a
"Data Bridge" program that using fabric issues commands against the DB
to build XML extracts that I can pull into SQL Server (or another RDBMS
using SQL Alchemy) to serve the reporting API's.** However the export
process for the U2 side can take an hour for a file with say about 1
million records.** That seemed slow to me, plus no way to try to get it
to go faster as that was maxing out a single core on the m10 in question
and it is a single threaded app.** This becomes a conundrum as the
timing of the extract needs to be after backups but before business
starts.
So I got to figuring I wonder if I could just pull the raw data files
and parse them.** I mean the data I need is plain text in there, I can
bring them down with fabric/scp and then work with them manually.** So
that is how we are here.
3.** What I know about the data.** There is a list of many possible
delimiters but there is a core few I truly care about.** I am no taking
the range of possible delimiters and cleaning them out.** By the time we
get to the recursive function there are 5 levels of delimiters it could
go through in recursion however each level could have a very large
branch count but the branches get handled serially so that should be ok.
4.** I did change the code, well added another entry point, to deal with
parsing the file line by line as opposed to a readlines().** No
immediate change in performance which I expect but when I get to testing
say that 1 million record file it will become relevant.
I hope this helps a bit as to there where why and it is not exactly
premature optimization but an attempt to try to proof if this different
method can compete or exceed the existing tested performance.
Ken
References
Visible links
1. mailto:ken at mack-z.com
More information about the TriZPUG
mailing list