HTML Parser which allows low-keyed local changes (upon serialization)

Robert no-spam at non-existing.invalid
Mon Feb 1 08:54:59 EST 2010


Robert wrote:
> Stefan Behnel wrote:
>> Robert, 31.01.2010 20:57:
>>> I tried lxml, but after walking and making changes in the element tree,
>>> I'm forced to do a full serialization of the whole document
>>> (etree.tostring(tree)) - which destroys the "human edited" format of the
>>> original HTML code. makes it rather unreadable.
>>
>> What do you mean? Could you give an example? lxml certainly does not
>> destroy anything it parsed, unless you tell it to do so.
>>
> 
> of course it does not destroy during parsing.(?)
> 
> I mean: I want to walk with a Python script through the parsed tree HTML 
> and modify here and there things  (auto alt tags from DB/similar, link 
> corrections, text sections/translated sentences... due to HTML code and 
> content checks.)
> 
> Then I want to output the changed tree - but as close to the original 
> format as far as possible. No changes to my white space identation, 
> etc..  Only lokal changes, where really tags where changed.
> 
> Thats similiar like that what a good HTML editor does: After you made 
> little changes, it doesn't reformat/re-spit-out your whole code layout 
> from tree/attribute logic only. you have lokal changes only.
> But a simple HTML editor like that in Mozilla-Seamonkey outputs a whole 
> new HTML, produces the HTML from logical tree only (regarding his (ugly) 
> style), destroys my whitspace layout and much more  - forgetting 
> anything about the original layout.
> 
> Such a "good HTML editor" must somehow track the original positions of 
> the tags in the file. And during each logical change in the tree it must 
> tracks the file position changes/offsets. That thing seems to miss in 
> lxml and BeautifulSoup which I tried so far.
> 
> This is a frequent need I have. Nobody else's?
> 
> Seems I need to write my own or patch BS to do that extra tracking?
> 

basic feature(s) of such parser perhaps:

* can it tell for each tag object in the parsed tree, at what 
original file position start:end it resided? even a basic need: 
tell me the line number e.g. (for warning/analysis reports e.g.)

(* do the tree objects auto track/know if they were changed. (for 
convenience; a tree copy may serve this otherwise .. )

the creation of a output with local changes whould be rather 
simple from that ...


Robert



More information about the Python-list mailing list