Clean "Durty" strings

Mon Apr 2 12:17:29 EDT 2007

On Apr 2, 4:05 pm, "Diez B. Roggisch" <d... at nospam.web.de> wrote:
> > If the OP is constrained to standard libraries, then it may be a
> > question of defining what should be done more clearly. The extraneous
> > spaces can be removed by tokenizing the string and rejoining the
> > tokens. Replacing portions of a string with equivalents is standard
> > stuff. It might be preferable to create a function that will accept
> > lists of from and to strings and translate the entire string by
> > successively applying the replacements. From what I've seen so far,
> > that would be all the OP needs for this task. It might take a half-
> > dozen lines of code, plus the from/to table definition.
>
> The OP had <br>-tags in his text. Which is _more_ than a half dozen lines of
> code to clean up. Because your simple replacement-approach won't help here:
>
> <br>foo <br> bar </br>
>
> Which is perfectly legal HTML, but nasty to parse.
>
> Diez

But it could be that he just wants all HTML tags to disappear, like in
his example. A code like this might be sufficient then: re.sub(r'<[^>]
+>', '', s). For whitespace, re.sub(r'\s+', ' ', s). For XML
characters like é, re.sub(r'&(\w+);', lambda mo:
unichr(htmlentitydefs.name2codepoint[mo.group(1)]), s) and
re.sub(r'&#(\d+);', lambda mo: unichr(int(mo.group(1))), s). That's it
pretty much.

I'd like to see how this transformation can be done with
BeautifulSoup. Well, the last two regexps can be replaced with this:

unicode(BeautifulStoneSoup(s,convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0])