Newbie regular expression and whitespace question

Thu Sep 22 16:20:01 EDT 2005

"googleboy" <mynews44 at yahoo.com> wrote in message
news:1127419129.027712.252910 at g47g2000cwa.googlegroups.com...
> Hi.
>
> I am trying to collapse an html table into a single line.  Basically,
> anytime I see ">" & "<" with nothing but whitespace between them,  I'd
> like to remove all the whitespace, including newlines. I've read the
> how-to and I have tried a bunch of things,  but nothing seems to work
> for me:
>
> --
>
> table = open(r'D:\path\to\tabletest.txt', 'rb')
> strTable = table.read()
>
> #Below find the different sort of things I have tried, one at a time:
>
> strTable = strTable.replace(">\s<", "><") #I got this from the module
> docs
>
> strTable = strTable.replace(">.<", "><")
>
> strTable = ">\s+<".join(strTable)
>
> strTable = ">\s<".join(strTable)
>
> print strTable
>
> --
>
> The table in question looks like this:
>
> <table width="80%"  border="0">
>   <tr>
>     <td> </td>
>     <td colspan="2">Introduction</td>
>     <td><div align="right">3</div></td>
>   </tr>
>   <tr>
>     <td> </td>
>   </tr>
>   <tr>
>     <td><i>ONE</i></td>
>     <td colspan="2">Childraising for Parrots</td>
>     <td><div align="right">11</div></td>
>   </tr>
> </table>
>
>
>
> For extra kudos (and I confess I have been so stuck on the above
> problem I haven't put much thought into how to do this one) I'd like to
> be able to measure the number of characters between the <p> & </p>
> tags, and then insert a newline character at the end of the next word
> after an arbitrary number of characters.....   I am reading in to a
> script a bunch of paragraphs formatted for a webpage, but they're all
> on one big long line and I would like to split them for readability.
>
> TIA
>
> Googleboy
>
If you're absolutely stuck on using RE's, then others will have to step
forward.  Meanwhile, here's a pyparsing solution (get pyparsing at
http://pyparsing.sourceforge.net):

---------------
from pyparsing import *

LT = Literal("<")
GT = Literal(">")

collapsableSpace = GT + LT    # matches with or without intervening
whitespace
collapsableSpace.setParseAction( replaceWith("><") )

print collapsableSpace.transformString(data)
---------------

The reason this works is that pyparsing implicitly skips over whitespace
while looking for matches of collapsable space (a '>' followed by a '<').
When found, the parse action is triggered, which in this case, replaces
whatever was matched with the string "><".  Finally, the input data (in this
case your HTML table, stored in the string variable, data) is passed to
transformString, which scans for matches of the collapsableSpace expression,
runs the parse action when they are found, and returns the final transformed
string.

As for word wrapping within <p>...</p> tags, there are at least two recipes
in the Python Cookbook for word wrapping.  Be careful, though, as many HTML
pages are very bad about omitting the trailing </p> tags.

-- Paul