Find relative url in mixed text/html

Sat Nov 28 00:11:58 EST 2015

Rob Hills <rhills at medimorphosis.com.au> writes:
> Note, in the beginning of this project, I looked at using "Beautiful
> Soup" but my reading and limited testing lead me to believe that it is
> designed for well-formed HTML/XML and therefore was unsuitable for the
> text/html soup I have.  If that belief is incorrect, I'd be grateful for
> general tips about using Beautiful Soup in this scenario...

Beautiful Soup can deal with badly formed HTML pretty well, or at least
it could in earlier versions.  It gives you several different parsing
options to choose from now.  I think the default is lxml which is fast
but maybe more strict.  Check what the others are and see if a loose
slow one is still there.  It really is pretty slow so plan on a big
computation task if you're converting a large forum.

phpBB gets a bad rap that's maybe well-deserved but I don't know what to
suggest instead.