[Mailman-Users] Migrating from YahooGroups to Mailman

Tue Jul 31 22:46:19 CEST 2001

On 31 July 2001, Bradford Shaw said:
> I would also be interested in this. We just moved a large list off of 
> yahoogroups to its own server and Yahoo refuses to cooperate in letting us 
> have the archived messages (over 2 years worth at approx 1500 per month). 

I've just poked around groups.yahoo.com a bit, and it looks like this is
doable (but painful).  From any message, you can hit the "View source"
link.  This takes you to a page like

  http://groups.yahoo.com/group/NucNews/message/3859?source=1

(message 3859 of the "NucNews" list).

The good news:

  * the URL is dead easy to generate, eg. here's Python code to suck
    the entire archives for a list, one HTML file per message:

      url_template = "http://groups.yahoo.com/group/%s/message/%d?source=1"

      group_name = "NucNews"
      num_messages = 3887                  # every message page has this
      for msg_num in xrange(num_messages):
        url = url_template % (group_name, msg_num)
        msg_filename = "msg-%04d.html" % msg_num
        urllib.urlretrieve(url, msg_filename)

    (UNTESTED -- YMMV)

The bad news:

  * the HTML you download will need serious massaging before it's genuinely
    plain text (ie. valid RFC 822 messages).  Eg. here's a sample from
    message 1 of NucNews:

"""
<!-- start of guts !-->
<!-- Layout !-->
<table border="0" cellspacing="0" cellpadding="0" width="100%">
<pre>From <a href="/group/NucNews/post?protectID=197212253115067135062046036199121208067038025008">prop1 at xxxxx.xxxx</a> Tue Jan 12 14:08:50 1999
X-Digest-Num: 0
Message-ID: &lt;<a href="/group/NucNews/post?protectID=204183107153178091074082017036098126254083020093090065230045073141210143030150043098201196026">62814.0.1.959296473 at e...</a>&gt;
Date: Tue, 12 Jan 1999 17:08:50 -0500
[...]
"""

Note how anything that looks like an email address (including the
return-path and message-id headers) are turned into hyperlinks.  You'll
need to strip out these <a href> tags (see Python's htmllib, although
you can probably kludge it with a regex) and just preserve the content
of the tag, eg. "prop1 at xxxxx.xxxx".

Even after doing this, you still don't have a valid RFC 822 message,
eg. here's another excerpt from msg 1 of NucNews:

  From: Peace Through Reason &lt;<a href="/group/NucNews/post?protectID=197212253115067135062046036199121208067038025008">prop1 at xxxxx.xxxx</a>

De-HTML-ified, that becomes:

  From: Peace Through Reason <prop1 at xxxxx.xxxx

which, even if you ignore the mangled email address, is still missing a
trailing angle-bracket.  Someone goofed in converting this message to
HTML, and now you lose!  You'll probably have to add some "Fix Yahoo!
bogosity" heuristics to your script.  Bummer.

Executive summary: your situation isn't completely hopeless, but it sure
does suck.  Bummer.

        Greg
-- 
Greg Ward - software developer                gward at mems-exchange.org
MEMS Exchange                            http://www.mems-exchange.org