[TriPython] A question of recursion and nested parsing
Ken MacKenzie
ken at mack-z.com
Fri Sep 15 12:32:17 EDT 2017
OK I have gone through all the responses, let me try to address what I can
1. I pasted from emacs into gmail, sorry it formatted poorly. I will get
the code out into a repo later today to make it easier to grab.
2. A note about the code, right now it is a simple module that I am
testing in ipython. So the bigger picture. We have a u2 DB on Solaris
here. U2 is an atrocious select performer so to serve so RESTful API's
(using falcon) to support PowerBI reporting I wrote a "Data Bridge" program
that using fabric issues commands against the DB to build XML extracts that
I can pull into SQL Server (or another RDBMS using SQL Alchemy) to serve
the reporting API's. However the export process for the U2 side can take
an hour for a file with say about 1 million records. That seemed slow to
me, plus no way to try to get it to go faster as that was maxing out a
single core on the m10 in question and it is a single threaded app. This
becomes a conundrum as the timing of the extract needs to be after backups
but before business starts.
So I got to figuring I wonder if I could just pull the raw data files and
parse them. I mean the data I need is plain text in there, I can bring
them down with fabric/scp and then work with them manually. So that is how
we are here.
3. What I know about the data. There is a list of many possible
delimiters but there is a core few I truly care about. I am no taking the
range of possible delimiters and cleaning them out. By the time we get to
the recursive function there are 5 levels of delimiters it could go through
in recursion however each level could have a very large branch count but
the branches get handled serially so that should be ok.
4. I did change the code, well added another entry point, to deal with
parsing the file line by line as opposed to a readlines(). No immediate
change in performance which I expect but when I get to testing say that 1
million record file it will become relevant.
I hope this helps a bit as to there where why and it is not exactly
premature optimization but an attempt to try to proof if this different
method can compete or exceed the existing tested performance.
Ken
-------------- next part --------------
OK I have gone through all the responses, let me try to address what I can
1.** I pasted from emacs into gmail, sorry it formatted poorly.** I will
get the code out into a repo later today to make it easier to grab.
2.** A note about the code, right now it is a simple module that I am
testing in ipython.** So the bigger picture.** We have a u2 DB on Solaris
here.** U2 is an atrocious select performer so to serve so RESTful API's
(using falcon) to support PowerBI reporting I wrote a "Data Bridge"
program that using fabric issues commands against the DB to build XML
extracts that I can pull into SQL Server (or another RDBMS using SQL
Alchemy) to serve the reporting API's.** However the export process for
the U2 side can take an hour for a file with say about 1 million
records.** That seemed slow to me, plus no way to try to get it to go
faster as that was maxing out a single core on the m10 in question and it
is a single threaded app.** This becomes a conundrum as the timing of the
extract needs to be after backups but before business starts.
So I got to figuring I wonder if I could just pull the raw data files and
parse them.** I mean the data I need is plain text in there, I can bring
them down with fabric/scp and then work with them manually.** So that is
how we are here.
3.** What I know about the data.** There is a list of many possible
delimiters but there is a core few I truly care about.** I am no taking
the range of possible delimiters and cleaning them out.** By the time we
get to the recursive function there are 5 levels of delimiters it could go
through in recursion however each level could have a very large branch
count but the branches get handled serially so that should be ok.
4.** I did change the code, well added another entry point, to deal with
parsing the file line by line as opposed to a readlines().** No immediate
change in performance which I expect but when I get to testing say that 1
million record file it will become relevant.
I hope this helps a bit as to there where why and it is not exactly
premature optimization but an attempt to try to proof if this different
method can compete or exceed the existing tested performance.
Ken
More information about the TriZPUG
mailing list