Find relative url in mixed text/html

Sat Nov 28 02:07:32 EST 2015

On 28/11/15 03:35, Rob Hills wrote:
> Hi,
>
> For my sins I am migrating a volunteer association forum from one
> platform (WebWiz) to another (phpBB).  I am (I hope) 95% of the way
> through the process.
>
> Posts to our original forum comprise a soup of plain text, HTML and
> BBCodes.  A post */may/* include links done as either standard HTML
> links ( <a href=... ), BBCode links ( [url]http://... [/url] ) or
> sometimes just text: ( http://blah.blah.com.au or even just
> www.blah.blah.com.au ).
>
> In my conversion process, I am trying to identify cross-links (links
> from one post on the forum to another) so I can convert them to links
> that will work in the new forum.
>
> My current code uses a Regular Expression (yes, I read the recent posts
> on this forum about regex and HTML!) to pull out "absolute" links (
> starting with http:// ) and then I use Python to identify and convert
> the specific links I am interested in.  However, the forum also contains
> "cross-links" done using relative links and I'm unsure how best to
> proceed with that one.  Googling so far has not been helpful, but that
> might be me using the wrong search terms.
>
> Some examples of what I am talking about are:
>
>      Post fragment containing an "Absolute" cross-link:
>
>      <br />ive made a new thread:
>      <br />http://www.aeva.asn.au/forums/forum_posts.asp?TID=316&PID=1958#1958
>      <br />
>
>      converts to:
>
>      <br />
>      <br />ive made a new thread:
>      <br />/viewtopic.php?t=316&p=1958#1958
>
>      Post fragment containing a "Relative" cross-link:
>
>      <font size="3"><u>Battery Management System</u></font><br /><a href="/forum_posts.asp?TID=980&PID=15479#15479" target="_blank" rel="nofollow">Veroboard prototype</a><br />
>
>      Needs converting to:
>
>      <font size="3"><u>Battery Management System</u></font><br /><a href="/viewtopic.php?p=15479&t=980#15479" target="_blank" rel="nofollow">Veroboard prototype</a><br />
>
> So, my question is:  What is the best way to extract a list of "relative
> links" from mixed text/html that I can then walk through to identify the
> specific ones I want to convert?
>
> Note, in the beginning of this project, I looked at using "Beautiful
> Soup" but my reading and limited testing lead me to believe that it is
> designed for well-formed HTML/XML and therefore was unsuitable for the
> text/html soup I have.  If that belief is incorrect, I'd be grateful for
> general tips about using Beautiful Soup in this scenario...
>
> TIA,
>

Hi Rob

Is it safe to assume that all the relative (cross) links take one of the 
following forms? :

	http://www.aeva.asn.au/forums/forum_posts.asp
	www.aeva.asn.au/forums/forum_posts.asp
	/forums/forum_posts.asp
	/forum_posts.asp (are you really sure about this one?)

If so, and if your goal boils down to converting all instances of old 
style URLs to new style ones regardless of the context where they 
appear, why would a regex fail to meet your needs?