[Tutor] formatting xml (again)

Tue Dec 27 16:47:32 EST 2016

* richard kappler <richkappler at gmail.com> [2016-12-27 16:05]:
> The input is consistent in that it all has stx at the beginning of each
> 'event.' I'm leaning towards regex. When you say:
> 
> " find stx, stuff lines until I see the next stx, then dump and continue"
> 
> Might I trouble you for an example of how you do that? I can find stx, I
> can find etx using something along the lines of :
> 
> a = [m.start() for m in re.finditer(r"<devicename>", line)]
> 
> but then I get a little lost, mostly because I have some lines that have
> "data data [\x03][\x02] data" and then to the next line. More succinctly,
> the stx aren't always at the beginning of the line, etx not always at the
> end. No problem, I can find them, but then I'm guessing I would have to
> write to a buffer starting with stx, keep writing to the buffer until I get
> to etx, write the buffer to file (or send it over the socket, either way is
> fine) then continue on. The fact that 'events' span multiple lines is
> challenging me.

Well, that shows that in the context of line-based data, it is not
consistent.  That's the main issue.  If you knew that every event
started on a new line, then you could fairly easily:

if '\x02' in line:
    output = line.strip()
    while '\x02' not in line:
        output = output + line.strip()

etc.

Unfortunately, we don't have that kind of line-based consistency.  You
are either going to have to treat it more like a binary stream of data,
triggering on stx and etx on a character-by-character basis, or you are
going to have to test for both stx and etx on each line and do different
things based on the combination you find.  Some possible options for
a single line appear to be:

[\x02]
[\x02] data
[\x02] data [\x03]
[\x02] data [\x03][\x02]
[\x03]
data [\x03]
data [\x03][\x02]
data [\x03][\x02] data

etc

That's assuming something really ugly like this couldn't happen on a
single line (but somehow I think it probably can):
data [\x03][\x02] data [\x03][\x02]

I think you are stuck reading as a character stream, rather than a
line-based text file due to the unstable nature of the input.

Another possibility (I suppose) would be to read per line and split on
the \x02 yourself (I'm assuming that's actually a single hex character).
That would artificially create "record" data that you could manipulate
and combine partial segments into complete xml records to parse.  Might
be faster, might not, probably would get complicated pretty quickly but
could be an option.

Without seeing actual data, it's tough to speculate what the best approach
would be.

-- 
David Rock
david at graniteweb.com