Excess whitespace in my soup
Remco Gerlich
remco at gerlich.nl
Sat Jan 19 07:12:57 EST 2008
Not sure if this is sufficient for what you need, but how about
import re
re.sub(u'[\s\xa0]+', ' ', s)
That should replace all occurances of 1 or more whitespace or \xa0
characters, by a single space.
Remco
On Jan 19, 2008 12:38 PM, John Machin <sjmachin at lexicon.net> wrote:
> I'm trying to recover the original data from some HTML written by a
> well-known application.
>
> Here are three original data items, in Python repr() format, with
> spaces changed to tildes for clarity:
>
> u'Saturday,~19~January~2008'
> u'Line1\nLine2\nLine3'
> u'foonly~frabjous\xa0farnarklingliness'
>
> Here is the HTML, with spaces changed to tildes, angle brackets
> changed to square brackets,
> omitting \r\n from the end of each line, and stripping a large number
> of attributes from the [td] tags.
>
> ~~[td]Saturday,~19
> ~~January~2008[/td]
> ~~[td]Line1[br]
> ~~~~Line2[br]
> ~~~~Line3[/td]
> ~~[td]foonly
> ~~frabjous farnarklingliness[/td]
>
> Here are the results of feeding it to ElementSoup:
>
> >>> import ElementSoup as ES
> >>> elem = ES.parse('ws_soup1.htm')
> >>> from pprint import pprint as pp
> >>> pp([(e.tag, e.text, e.tail) for e in elem.getiterator()])
> [snip]
> (u'td', u'Saturday, 19\n January 2008', u'\n'),
> (u'td', u'Line1', u'\n'),
> (u'br', None, u'\n Line2'),
> (u'br', None, u'\n Line3'),
> (u'td', u'foonly\n frabjous\xa0farnarklingliness', u'\n')]
>
> I'm happy enough with reassembling the second item. The problem is in
> reliably and
> correctly collapsing the whitespace in each of the above five
> elements. The standard Python
> idiom of u' '.join(text.split()) won't work because the text is
> Unicode and u'\xa0' is whitespace
> and would be converted to a space.
>
> Should whitespace collapsing be done earlier? Note that BeautifulSoup
> leaves it as -- ES does the conversion to \xa0 ...
>
> Does anyone know of an html_collapse_whitespace() for Python? Am I
> missing something obvious?
>
> Thanks in advance,
> John
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20080119/816fb05c/attachment-0001.html>
More information about the Python-list
mailing list