HTML Parser which allows low-keyed local changes (upon serialization)

Mon Feb 1 12:16:20 EST 2010

Stefan Behnel wrote:
> Robert, 01.02.2010 14:36:
>> Stefan Behnel wrote:
>>> Robert, 31.01.2010 20:57:
>>>> I tried lxml, but after walking and making changes in the element tree,
>>>> I'm forced to do a full serialization of the whole document
>>>> (etree.tostring(tree)) - which destroys the "human edited" format of the
>>>> original HTML code. makes it rather unreadable.
>>> What do you mean? Could you give an example? lxml certainly does not
>>> destroy anything it parsed, unless you tell it to do so.
>> of course it does not destroy during parsing.(?)
> 
> I meant "parsed" in the sense of "has parsed and is now working on".
> 
> 
>> I mean: I want to walk with a Python script through the parsed tree HTML
>> and modify here and there things  (auto alt tags from DB/similar, link
>> corrections, text sections/translated sentences... due to HTML code and
>> content checks.)
> 
> Sure, perfectly valid use case.
> 
> 
>> Then I want to output the changed tree - but as close to the original
>> format as far as possible. No changes to my white space identation,
>> etc..  Only lokal changes, where really tags where changed.
> 
> That's up to you. If you only apply local changes that do not change any
> surrounding whitespace, you'll be fine.
> 
> 
>> Thats similiar like that what a good HTML editor does: After you made
>> little changes, it doesn't reformat/re-spit-out your whole code layout
>> from tree/attribute logic only. you have lokal changes only.
> 
> HTML editors don't work that way. They always "re-spit-out" the whole code
> when you click on "save". They certainly don't track the original file
> position of tags. What they preserve is the content, including whitespace
> (or not, if they reformat the code, but that's usually an *option*).
> 
> 
>> Such a "good HTML editor" must somehow track the original positions of
>> the tags in the file. And during each logical change in the tree it must
>> tracks the file position changes/offsets.
> 
> Sorry, but that's nonsense. The file position of a tag is determined by
> whitespace, i.e. line endings and indentation. lxml does not alter that,
> unless you tell it do do so.
> 
> Since you keep claiming that it *does* alter it, please come up with a
> reproducible example that shows a) what you do in your code, b) what your
> input is and c) what unexpected output it creates. Do not forget to include
> the version number of lxml and libxml2 that you are using, as well as a
> comment on /how/ the output differs from what you expected.
> 
> My stab in the dark is that you forgot to copy the tail text of elements
> that you replace by new content, and that you didn't properly indent new
> content that you added. But that's just that, a stab in the dark. You
> didn't provide enough information for even an educated guess.
> 

I think you confused the logical level of what I meant with "file 
position":
Of course its not about (necessarily) writing back to the same 
open file (OS-level), but regarding the whole serializiation 
string (wherever it is finally written to - I typically write the 
auto-converted HTML files to a 2nd test folder first, and want use 
"diff -u ..." to see human-readable what changed happened - which 
again is only reasonable if the original layout is preserved as 
good as possible )

lxml and BeautifulSoup e.g. : load&parse a HTML file to a tree, 
immediately serialize the tree without changes => you see big 
differences of original and serialized files with quite any file.

The main issue: those libs seem to not track any info about the 
original string/file positions of the objects they parse. The just 
forget the past. Thus they cannot by principle do what I want it 
seems ...

Or does anybody see attributes of the tree objects - which I 
overlooked? Or a lib which can do or at least enable better this 
source-back-connected editing?

Robert