Excess whitespace in my soup

John Machin sjmachin at lexicon.net
Sat Jan 19 06:38:49 EST 2008


I'm trying to recover the original data from some HTML written by a
well-known application.

Here are three original data items, in Python repr() format, with
spaces changed to tildes for clarity:

u'Saturday,~19~January~2008'
u'Line1\nLine2\nLine3'
u'foonly~frabjous\xa0farnarklingliness'

Here is the HTML, with spaces changed to tildes, angle brackets
changed to square brackets,
omitting \r\n from the end of each line, and stripping a large number
of attributes from the [td] tags.

~~[td]Saturday,~19
~~January~2008[/td]
~~[td]Line1[br]
~~~~Line2[br]
~~~~Line3[/td]
~~[td]foonly
~~frabjous farnarklingliness[/td]

Here are the results of feeding it to ElementSoup:

>>> import ElementSoup as ES
>>> elem = ES.parse('ws_soup1.htm')
>>> from pprint import pprint as pp
>>> pp([(e.tag, e.text, e.tail) for e in elem.getiterator()])
[snip]
 (u'td', u'Saturday, 19\n  January 2008', u'\n'),
 (u'td', u'Line1', u'\n'),
 (u'br', None, u'\n    Line2'),
 (u'br', None, u'\n    Line3'),
 (u'td', u'foonly\n  frabjous\xa0farnarklingliness', u'\n')]

I'm happy enough with reassembling the second item. The problem is in
reliably and
correctly collapsing the whitespace in each of the above five
elements. The standard Python
idiom of u' '.join(text.split()) won't work because the text is
Unicode and u'\xa0' is whitespace
and would be converted to a space.

Should whitespace collapsing be done earlier? Note that BeautifulSoup
leaves it as   -- ES does the conversion to \xa0 ...

Does anyone know of an html_collapse_whitespace() for Python? Am I
missing something obvious?

Thanks in advance,
John



More information about the Python-list mailing list