Seek HTML cleanup utilities

Jon Roland jon.roland at constitution.org
Sun Nov 21 11:26:48 EST 2004


I have a number of changes I like to make to HTML files that are not
currently supported by HTML Tidy. Most of them arise from OCR
recognition errors, and many from the ways my OCR program, Finereader,
saves to HTML. I have begun to write stream editing scripts in python,
but wonder whether someone else may have already done so. It would
save me a lot of time to use or modify already-written utilities. I
would appreciate direction to any that are available. Please respond
by email.

Some of the kinds of cleanup I want to be able to do include:

1. Removal of empty tag pairs.

2. Trimming/moving whitespace around tags:
    a. Removal whitespace following a <p> and preceding
a </p>.
    b. Moving whitespace following lead tag to precede
it, preceding end tag to follow it.

3. Moving certain punctuation -- comma, period,
semi-colon, etc. -- outside of certain end tags, such
as </i>, </b>, etc.

4. Removal of certain attributes:
    a. In <font> tag, face="Times New Roman" (or
whatever) so that it will be viewed with default font face.
    b. In <font> tag, size="2" (or whatever) so that it
will ve viewed with default font size.

5. Changing of certain attributes:
    a. In <font> tag, absolute size="4" to relative
size="+1" (or whatever).

6. Changing of certain tags:
    a. <em> to <i>.
    b. <strong> to <b>.

7. Removal of certain tags, such as <p>, from around
all the contents of table cells.

8. For all tables, removal of empty topmost and
bottommost rows, leftmost and rightmost columns.

I could go on, but this provides a sample.

Please visit my website at http://www.constitution.org to see what
kinds of HTML documents I am producing.



More information about the Python-list mailing list