Find relative url in mixed text/html

Rob Hills rhills at medimorphosis.com.au
Sat Nov 28 11:25:07 EST 2015


Hi Paul,

On 28/11/15 13:11, Paul Rubin wrote:
> Rob Hills <rhills at medimorphosis.com.au> writes:
>> Note, in the beginning of this project, I looked at using "Beautiful
>> Soup" but my reading and limited testing lead me to believe that it is
>> designed for well-formed HTML/XML and therefore was unsuitable for the
>> text/html soup I have.  If that belief is incorrect, I'd be grateful for
>> general tips about using Beautiful Soup in this scenario...
> Beautiful Soup can deal with badly formed HTML pretty well, or at least
> it could in earlier versions.  It gives you several different parsing
> options to choose from now.  I think the default is lxml which is fast
> but maybe more strict.  Check what the others are and see if a loose
> slow one is still there.  It really is pretty slow so plan on a big
> computation task if you're converting a large forum.

I've had another look at Beautiful Soup and while it doesn't really help
me much with urls (relative or absolute) embedded within text, it seems
to do a good job of separating out links from the rest, so that could be
useful in itself.

WRT time, I'm converting about 65MB of data which currently takes 14
seconds (on a 3yo laptop with a SSD running Ubuntu), which I reckon is
pretty amazing performance for Python3, especially given my relatively
crude coding skills.  It'll be interesting to see if using Beautiful
Soup adds significantly to that.

> phpBB gets a bad rap that's maybe well-deserved but I don't know what to
> suggest instead.

I did start to investigate Python-based alternatives; I've not heard
much good said about php, but I probably move in the wrong circles. 
However, our hosting service doesn't support Python so I stopped
hunting.  Plus there is a significant group of forum members who hold
very strong opinions about the functionality they want and it took a lot
of work to get them to agree on something!

All that said, I'd be interested to see specific (and hopefully
unbiased) info about phpBB's failings...

Cheers,

-- 
Rob Hills
Waikiki, Western Australia




More information about the Python-list mailing list