Excess whitespace in my soup

Sat Jan 19 07:12:57 EST 2008

Not sure if this is sufficient for what you need, but how about

import re
re.sub(u'[\s\xa0]+', ' ', s)

That should replace all occurances of 1 or more whitespace or \xa0
characters, by a single space.

Remco

On Jan 19, 2008 12:38 PM, John Machin <sjmachin at lexicon.net> wrote:

> I'm trying to recover the original data from some HTML written by a
> well-known application.
>
> Here are three original data items, in Python repr() format, with
> spaces changed to tildes for clarity:
>
> u'Saturday,~19~January~2008'
> u'Line1\nLine2\nLine3'
> u'foonly~frabjous\xa0farnarklingliness'
>
> Here is the HTML, with spaces changed to tildes, angle brackets
> changed to square brackets,
> omitting \r\n from the end of each line, and stripping a large number
> of attributes from the [td] tags.
>
> ~~[td]Saturday,~19
> ~~January~2008[/td]
> ~~[td]Line1[br]
> ~~~~Line2[br]
> ~~~~Line3[/td]
> ~~[td]foonly
> ~~frabjous farnarklingliness[/td]
>
> Here are the results of feeding it to ElementSoup:
>
> >>> import ElementSoup as ES
> >>> elem = ES.parse('ws_soup1.htm')
> >>> from pprint import pprint as pp
> >>> pp([(e.tag, e.text, e.tail) for e in elem.getiterator()])
> [snip]
>  (u'td', u'Saturday, 19\n  January 2008', u'\n'),
>  (u'td', u'Line1', u'\n'),
>  (u'br', None, u'\n    Line2'),
>  (u'br', None, u'\n    Line3'),
>  (u'td', u'foonly\n  frabjous\xa0farnarklingliness', u'\n')]
>
> I'm happy enough with reassembling the second item. The problem is in
> reliably and
> correctly collapsing the whitespace in each of the above five
> elements. The standard Python
> idiom of u' '.join(text.split()) won't work because the text is
> Unicode and u'\xa0' is whitespace
> and would be converted to a space.
>
> Should whitespace collapsing be done earlier? Note that BeautifulSoup
> leaves it as   -- ES does the conversion to \xa0 ...
>
> Does anyone know of an html_collapse_whitespace() for Python? Am I
> missing something obvious?
>
> Thanks in advance,
> John
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20080119/816fb05c/attachment-0001.html>